1
Learning Multimodal Representations with Factorized Deep Generative Models Yao-Hung Hubert Tsai *† , Paul Pu Liang *† , Amir Zadeh , Louis-Philippe Morency , and Ruslan Salakhutdinov { Machine Learning Department, Language Technologies Institute}, Carnegie Mellon University * equal contributions, {yaohungt,pliang,abagherz,morency,rsalakhu}@cs.cmu.edu Multimodal Factorization Model Bayesian Network Generative Network Inference Network Notations X 1:M : multimodal data from M modalities, Y: labels ˆ X 1:M : generated multimodal data, ˆ Y: generated labels Z a· : modality-specific latent variables, F · : factors Summary Joint generative-discriminative objective for multimodal data. Factorize representation into independent sets of factors * Multimodal Discriminative factors * Modality-Specific Generative factors Neural Architecture Encoder Q(Z y |X 1:M ) can be parametrized by any model that per- forms multimodal fusion. Contributions SOTA performance on six multimodal datasets. Flexible generative capabilities by independent factors. Ability to reconstruct missing modalities. Interpreting multimodal interactions. Generation, Inference, and Learning Generation factorization over joint distribution P ( ˆ X 1:M , ˆ Y)= Z F,Z P ( ˆ X 1:M , ˆ Y|F)P (F|Z)P (Z)dFdZ = Z F y ,F a {1:M } Z y ,Z a {1:M } P ( ˆ Y|F y ) M Y i=1 P ( ˆ X i |F ai , F y ) P (F y |Z y ) M Y i=1 P (F ai |Z ai ) P (Z y ) M Y i=1 P (Z ai ) dF dZ, with dF = dF y Q M i=1 dF a i and dZ = dZ y Q M i=1 dZ a i Inference Joint-Distribution Wasserstein Distance Approximation for intractable exact inference Proposition 1: For any functions G y : Z y F y , G a{1:M } : Z a{1:M } F a{1:M } , D : F y ˆ Y, and F 1:M : F a{1:M } , F y ˆ X 1:M , we have Joint-Distribution Wasserstein dis- tance W c (P X 1:M ,Y ,P ˆ X 1:M , ˆ Y )= inf Q Z =P Z E P X 1:M ,Y E Q(Z|X 1:M ,Y) " M X i=1 c X i X i ,F i ( G ai (Z ai ),G y (Z y ) ) + c Y Y,D ( G y (Z y ) ) # , where P Z is the prior over Z =[Z y , Z a {1,M } ] and Q Z is the ag- gregated posterior of the proposed approximate inference distribution Q(Z|X 1:M , Y). Generalized mean field assumption Q(Z|X 1:M , Y) := Q(Z|X 1:M ) := Q(Z y |X 1:M ) M Y i=1 Q(Z a i |X i ). Relaxed Objective min F,G a{1:M } ,G y ,D inf Q(Z)∈Q E P X 1:M ,Y E Q(Z a1 |X 1 ) ··· E Q(Z aM |X M ) E Q(Z y |X 1:M ) " M X i=1 c X i X i ,F ( G ai (Z ai ),G y (Z y ) ) + c Y Y,D ( G y (Z y ) ) # + λMMD(Q Z ,P Z ), with P Z being centered isotropic Gaussian N (0, I) with Z = [Z y , Z a {1,M } ] Surrogate Inference for Missing Modalities Φ: Surrogate inference network Φ * = argmin Φ E P X 2:M , ˆ X 1 - log P Φ ( ˆ X 1 |X 2:M ) with P Φ ( ˆ X 1 |X 2:M ) := Z P ( ˆ X 1 |Z a1 , Z y )Q Φ (Z a1 |X 2:M )Q Φ (Z y |X 2:M ) dZ a1 dZ y . Deterministic mappings in Q Φ (·|·) P Φ ( ˆ Y|X 2:M ) := R P ( ˆ Y|Z y )Q Φ (Z y |X 2:M ) dZ y Controllable Generation Digits Dataset Handwritten (MNISTa [3]) + Street-view House Numbers (SVHN [5]) Results Multimodal Time Series Dataset Datasets in Human Multimodal Language Multimodal Personal Trait Recognition * Movie Reviews (POM [6]) Multimodal Sentiment Analysis * Monologue Opinion (CMU-MOSI [10]) * Online Social Review (ICT-MMMO [8]) * Product Review and Opinion (MOUD [7] / YouTube [4]) Multimodal Emotion Recognition * Recorded Dyadic Dialogues (IEMOCAP [1]) Multimodal Features Language: pre-trained Glove word embeddings Visual: facial action units from Facet Acoustic: MFCCs from COVAREP Aligned by P2FA Results Dataset POM Personality Traits Task Con Pas Voi Dom Cre Viv Exp Ent Res Tru Rel Out Tho Ner Per Hum Metric r SOTA2 0.359 0.425 0.166 0.235 0.358 0.417 0.450 0.378 0.295 0.237 0.215 0.238 0.363 0.258 0.344 0.319 SOTA1 0.395 # 0.428 # 0.193 # 0.313 # 0.367 # 0.431 # 0.452 # 0.395 # 0.333 # 0.296 # 0.255 # 0.259 # 0.381 # 0.318 # 0.377 # 0.386 # MFM 0.431 0.450 0.197 0.411 0.380 0.448 0.467 0.452 0.368 0.212 0.309 0.333 0.404 0.333 0.334 0.408 Dataset CMU-MOSI ICT-MMMO YouTube MOUD Task Sentiment Sentiment Sentiment Sentiment Metric Acc 7 Acc 2 F1 MAE r Acc 2 F1 Acc 3 F1 Acc 2 F1 SOTA2 34.1 # 77.1 77.0 0.968 0.625 72.5 * 72.6 * 48.3 45.1 81.1 # 80.4 # SOTA1 34.7 77.4 # 77.3 # 0.965 # 0.632 # 73.8 # 73.1 # 51.7 # 51.6 # 81.1 81.2 MFM 36.2 78.1 78.1 0.951 0.662 81.3 79.2 53.3 52.4 82.1 81.7 Dataset IEMOCAP Emotions Task Happy Sad Angry Frustrated Excited Neutral Metric Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 SOTA2 86.7 84.2 § 83.4 * 81.7 85.1 84.5 § 79.5 76.6 89.6 86.3 # 68.8 § 67.1 § SOTA1 90.1 # 85.3 # 85.8 # 82.8 * 87.0 # 86.0 # 80.3 # 76.8 # 89.8 # 87.1 71.8 # 68.5 § MFM 90.2 85.8 88.4 86.1 87.5 86.7 80.4 74.5 90.0 87.1 72.1 68.1 Ablation Study On CMU-MOSI Missing Modalities On CMU-MOSI Task ˆ X · Reconstruction ˆ Y Prediction Metric MSE () MSE (a) MSE (v ) Acc 7 Acc 2 F1 MAE r Purely Generative and Discriminative Baselines (anguage) missing 0.0411 - - 19.4 59.6 59.7 1.386 0.225 a(udio) missing - 0.0533 - 34.0 73.5 73.4 1.024 0.615 v (isual) missing - - 0.0220 33.7 75.4 75.4 0.996 0.634 Multimodal Factorization Model (MFM) (anguage) missing 0.0403 - - 21.7 62.0 61.7 1.313 0.236 a(udio) missing - 0.0468 - 35.4 74.3 74.3 1.011 0.603 v (isual) missing - - 0.0215 35.0 76.4 76.3 0.990 0.635 all present 0.0391 0.0384 0.0182 36.2 78.1 78.1 0.951 0.662 Analyzing Multimodal Representations Information-Based Interpretation Analysis on overall trends Hilbert-Schmidt Independence Criterion [2, 9] MI(F · , ˆ X i ) = HSIC norm (F · , ˆ X i )= tr(K F · HK ˆ X i H) kHK F · Hk F kHK ˆ X i Hk F , Normalized Ratios r i = MI(F y , ˆ X i )/MI(F a i , ˆ X i ) Ratio r r v r a CMU-MOSI 0.307 0.030 0.107 Gradient-Based Interpretation Fine-grained analysis Generated data ˆ x i = [ˆ x 1 i , ··· , ˆ x t i , ··· , ˆ x T i ] ˆ x i = F i (f ai ,f y ),f ai = G ai (z ai ),f y = G y (z y ),z ai Q(Z ai |X i = x i ),z y Q(Z y |X 1:M = x 1:M ) Gradients Flow f y x i ) :=[k∇ f y ˆ x 1 i k 2 F , k∇ f y ˆ x 2 i k 2 F , ··· , k∇ f y ˆ x T i k 2 F ]. Umm, in a way, a lot of the themes in “never let me go”, which were very profound and deep. (hesitancy) (emphasis) (neutral) language visual acoustic ( & ) ( & ) ( & ) =1 = 20 (uninformative) (uninformative) (slight smile) References [1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion cap- ture database. Journal of Language Resources and Evaluation, 2008. [2] A. Gretton, O. Bousquet, A. Smola, and B. Sch¨ olkopf. Measuring statistical de- pendence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005. [3]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. [4]L.-P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis: Harvesting opinions from the web. In ICMI, pages 169–176. ACM, 2011. [5]Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. [6]S. Park, H. S. Shim, M. Chatterjee, K. Sagae, and L.-P. Morency. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. ICMI ’14, 2014. [7]V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-Level Multimodal Sen- timent Analysis. In Association for Computational Linguistics (ACL), Aug. 2013. [8] M. W¨ ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L.-P. Morency. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems, 28(3):46–53, 2013. [9]D. Wu, Y. Zhao, Y.-H. H. Tsai, M. Yamada, and R. Salakhutdinov. ” depen- dency bottleneck” in auto-encoding architectures: an empirical study. arXiv preprint arXiv:1802.05408, 2018. [10] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 2016.

Learning Multimodal Representations with Factorized Deep ...pliang/posters/nips2018ws_factorized_poster.pdf · !"#"$%&'(")*"&+,$-.#/"$"#0") Learning Multimodal Representations with

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning Multimodal Representations with Factorized Deep ...pliang/posters/nips2018ws_factorized_poster.pdf · !"#"$%&'(")*"&+,$-.#/"$"#0") Learning Multimodal Representations with

Learning Multimodal Representations with Factorized Deep Generative ModelsYao-Hung Hubert Tsai∗†, Paul Pu Liang∗†, Amir Zadeh‡, Louis-Philippe Morency‡, and Ruslan Salakhutdinov†

{†Machine Learning Department, ‡Language Technologies Institute}, Carnegie Mellon University∗equal contributions, {yaohungt,pliang,abagherz,morency,rsalakhu}@cs.cmu.edu

Multimodal Factorization Model• Bayesian Network

Generative Network Inference Network

•Notations– X1:M : multimodal data from M modalities, Y: labels– X1:M : generated multimodal data, Y: generated labels– Za·: modality-specific latent variables, F·: factors

• Summary– Joint generative-discriminative objective for multimodal data.– Factorize representation into independent sets of factors∗Multimodal Discriminative factors∗Modality-Specific Generative factors

•Neural Architecture– Encoder Q(Zy|X1:M ) can be parametrized by any model that per-

forms multimodal fusion.

•Contributions– SOTA performance on six multimodal datasets.– Flexible generative capabilities by independent factors.– Ability to reconstruct missing modalities.– Interpreting multimodal interactions.

Generation, Inference, and Learning•Generation

– factorization over joint distribution

P (X1:M , Y) =

∫F,Z

P (X1:M , Y|F)P (F|Z)P (Z)dFdZ

=

∫Fy,Fa{1:M}Zy,Za{1:M}

(P (Y|Fy)

M∏i=1

P (Xi|Fai,Fy))(P (Fy|Zy)

M∏i=1

P (Fai|Zai))(P (Zy)

M∏i=1

P (Zai))dF dZ,

with dF = dFy∏Mi=1 dFai and dZ = dZy

∏Mi=1 dZai

• Inference– Joint-Distribution Wasserstein Distance– Approximation for intractable exact inference– Proposition 1: For any functions Gy : Zy → Fy, Ga{1:M} :

Za{1:M} → Fa{1:M}, D : Fy → Y, and F1:M :

Fa{1:M},Fy → X1:M , we have Joint-Distribution Wasserstein dis-tance Wc(PX1:M ,Y, PX1:M ,Y

) =

infQZ=PZ

EPX1:M,YEQ(Z|X1:M ,Y)

[M∑i=1

cXi

(Xi, Fi

(Gai(Zai), Gy(Zy)

))+ cY

(Y, D

(Gy(Zy)

))],

where PZ is the prior over Z = [Zy,Za{1,M}] and QZ is the ag-gregated posterior of the proposed approximate inference distributionQ(Z|X1:M ,Y).

– Generalized mean field assumption

Q(Z|X1:M ,Y) := Q(Z|X1:M ) := Q(Zy|X1:M )

M∏i=1

Q(Zai|Xi).

•Relaxed Objective

minF,Ga{1:M},Gy,D

infQ(Z|·)∈Q

EPX1:M,YEQ(Za1|X1) · · ·EQ(ZaM |XM )EQ(Zy|X1:M )[

M∑i=1

cXi

(Xi, F

(Gai(Zai), Gy(Zy)

))+ cY

(Y, D

(Gy(Zy)

))]+ λMMD(QZ, PZ),

with PZ being centered isotropic Gaussian N (0, I) with Z =[Zy,Za{1,M}]

• Surrogate Inference for Missing Modalities– Φ: Surrogate inference network

Φ∗ = argminΦ

EPX2:M,X1

(− logPΦ(X1|X2:M)

)with PΦ(X1|X2:M) :=

∫P (X1|Za1,Zy)QΦ(Za1|X2:M)QΦ(Zy|X2:M) dZa1 dZy.

– Deterministic mappings in QΦ(·|·)– PΦ(Y|X2:M ) :=

∫P (Y|Zy)QΦ(Zy|X2:M ) dZy

Controllable Generation•Digits Dataset

– Handwritten (MNISTa [3]) + Street-view House Numbers (SVHN [5])

•Results

Multimodal Time Series Dataset

•Datasets in Human Multimodal Language

– Multimodal Personal Trait Recognition∗Movie Reviews (POM [6])

– Multimodal Sentiment Analysis∗Monologue Opinion (CMU-MOSI [10])∗Online Social Review (ICT-MMMO [8])∗ Product Review and Opinion (MOUD [7] / YouTube [4])

– Multimodal Emotion Recognition∗ Recorded Dyadic Dialogues (IEMOCAP [1])

•Multimodal Features

– Language: pre-trained Glove word embeddings– Visual: facial action units from Facet– Acoustic: MFCCs from COVAREP– Aligned by P2FA

•ResultsDataset POM Personality Traits

Task Con Pas Voi Dom Cre Viv Exp Ent Res Tru Rel Out Tho Ner Per HumMetric r

SOTA2 0.359† 0.425† 0.166‡ 0.235‡ 0.358† 0.417† 0.450† 0.378‡ 0.295� 0.237� 0.215‡ 0.238� 0.363† 0.258� 0.344† 0.319†

SOTA1 0.395# 0.428# 0.193# 0.313# 0.367# 0.431# 0.452# 0.395# 0.333# 0.296# 0.255# 0.259# 0.381# 0.318# 0.377# 0.386#

MFM 0.431 0.450 0.197 0.411 0.380 0.448 0.467 0.452 0.368 0.212 0.309 0.333 0.404 0.333 0.334 0.408

Dataset CMU-MOSI ICT-MMMO YouTube MOUDTask Sentiment Sentiment Sentiment Sentiment

Metric Acc 7 Acc 2 F1 MAE r Acc 2 F1 Acc 3 F1 Acc 2 F1

SOTA2 34.1# 77.1‡ 77.0‡ 0.968‡ 0.625‡ 72.5∗ 72.6∗ 48.3‡ 45.1† 81.1# 80.4#

SOTA1 34.7‡ 77.4# 77.3# 0.965# 0.632# 73.8# 73.1# 51.7# 51.6# 81.1‡ 81.2‡

MFM 36.2 78.1 78.1 0.951 0.662 81.3 79.2 53.3 52.4 82.1 81.7

Dataset IEMOCAP EmotionsTask Happy Sad Angry Frustrated Excited Neutral

Metric Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1

SOTA2 86.7‡ 84.2§ 83.4∗ 81.7† 85.1� 84.5§ 79.5‡ 76.6‡ 89.6‡ 86.3# 68.8§ 67.1§

SOTA1 90.1# 85.3# 85.8# 82.8∗ 87.0# 86.0# 80.3# 76.8# 89.8# 87.1‡ 71.8# 68.5§

MFM 90.2 85.8 88.4 86.1 87.5 86.7 80.4 74.5 90.0 87.1 72.1 68.1

•Ablation Study

– On CMU-MOSI

•Missing Modalities

– On CMU-MOSI

Task X· Reconstruction Y PredictionMetric MSE (`) MSE (a) MSE (v) Acc 7 Acc 2 F1 MAE r

Purely Generative and Discriminative Baselines`(anguage) missing 0.0411 - - 19.4 59.6 59.7 1.386 0.225a(udio) missing - 0.0533 - 34.0 73.5 73.4 1.024 0.615v(isual) missing - - 0.0220 33.7 75.4 75.4 0.996 0.634

Multimodal Factorization Model (MFM)`(anguage) missing 0.0403 - - 21.7 62.0 61.7 1.313 0.236a(udio) missing - 0.0468 - 35.4 74.3 74.3 1.011 0.603v(isual) missing - - 0.0215 35.0 76.4 76.3 0.990 0.635

all present 0.0391 0.0384 0.0182 36.2 78.1 78.1 0.951 0.662

Analyzing Multimodal Representations• Information-Based Interpretation

– Analysis on overall trends– Hilbert-Schmidt Independence Criterion [2, 9]

MI(F·, Xi) = HSICnorm(F·, Xi) =tr(KF·HK

XiH)

‖HKF·H‖F‖HKXiH‖F

,

– Normalized Ratios ri = MI(Fy, Xi)/MI(Fai, Xi)

Ratio r` rv raCMU-MOSI 0.307 0.030 0.107

•Gradient-Based Interpretation

– Fine-grained analysis– Generated data xi = [x1

i , · · · , xti, · · · , x

Ti ]

xi = Fi(fai, fy), fai = Gai(zai), fy = Gy(zy), zai ∼ Q(Zai|Xi = xi), zy ∼ Q(Zy|X1:M = x1:M )

– Gradients Flow

∇fy(xi) :=[‖∇fyx1i‖

2F , ‖∇fyx

2i‖

2F , · · · , ‖∇fyx

Ti ‖

2F ].

Umm, in a way, a lot of the themes in “never let me go”, which were very profound and deep.

(hesitancy) (emphasis) (neutral)

language

visual

acoustic

𝜵𝒇𝒚(𝒙&ℓ)

𝜵𝒇𝒚(𝒙&𝒗)

𝜵𝒇𝒚(𝒙&𝒂)

𝑡 = 1 𝑡 = 20

(uninformative)

(uninformative)

(slig

ht sm

ile)

References[1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang,

S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion cap-ture database. Journal of Language Resources and Evaluation, 2008.

[2] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf. Measuring statistical de-pendence with hilbert-schmidt norms. In International conference on algorithmiclearning theory, pages 63–77. Springer, 2005.

[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.

[4] L.-P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis:Harvesting opinions from the web. In ICMI, pages 169–176. ACM, 2011.

[5] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits innatural images with unsupervised feature learning. 2011.

[6] S. Park, H. S. Shim, M. Chatterjee, K. Sagae, and L.-P. Morency. Computationalanalysis of persuasiveness in social multimedia: A novel dataset and multimodalprediction approach. ICMI ’14, 2014.

[7] V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-Level Multimodal Sen-timent Analysis. In Association for Computational Linguistics (ACL), Aug. 2013.

[8] M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L.-P.Morency. Youtube movie reviews: Sentiment analysis in an audio-visual context.IEEE Intelligent Systems, 28(3):46–53, 2013.

[9] D. Wu, Y. Zhao, Y.-H. H. Tsai, M. Yamada, and R. Salakhutdinov. ” depen-dency bottleneck” in auto-encoding architectures: an empirical study. arXiv preprintarXiv:1802.05408, 2018.

[10] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. Multimodal sentiment intensityanalysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems,2016.