Emotion Generation and Recognition: A StarGAN Approach · [17]Mu Li, Wangmeng Zuo, and David Zhang. Deep identity-aware transfer of facial attributes. arXivpreprintarXiv:1610.05586,2016

Imperial College London

Department of Computing

Independent Study Option (CO-512)

Emotion Generation and Recognition: AStarGAN Approach

Author:Aritra BanerjeeCID: 01539378

Supervisor:Dimitrios Kollias

A report submitted for fulfillment of the degree

MSc Computing (Machine Learning)

October 25, 2019

arX

iv:1

910.

1109

0v1

[cs

.CV

] 1

2 O

ct 2

019

Abstract

The main idea of this ISO is to use StarGAN(A type of GAN model) to perform trainingand testing on an emotion dataset resulting in a emotion recognition which can be generatedby the valence arousal score of the 7 basic expressions. We have created an entirely newdataset consisting of 4K videos.This dataset consists of all the basic 7 types of emotions:

• Happy

• Sad

• Angry

• Surprised

• Fear

• Disgust

• Neutral

We have performed face detection and alignment followed by annotating basic valencearousal values to the frames/images in the dataset depending on the emotions manually.Then the existing StarGAN model is trained on our created dataset after which somemanual subjects were chosen to test the efficiency of the trained StarGAN model.

Acknowledgements

I would like to thank everyone who have provided particularly useful assistance, technical orotherwise, during my Independent Study Option (ISO). I would like to thank my supervisorDimitrios Kollias specifically for being an outstanding mentor and guide throughout theentire project. He guided me through every step and ensured that I work on this ISOaccording to my full potential within the given deadline.

Contents

1 Introduction 41.1 Generative Adversarial Networks(GAN) . . . . . . . . . . . . . . . . . . . . 41.2 StarGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7

3 Collecting and Preprocessing the Dataset 93.1 Collection of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Preprocessing the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Face Detection and Alignment . . . . . . . . . . . . . . . . . . . . . 103.2.2 Annotating the dataset . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Method Implementation and Results 134.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Training the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3 Testing the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Conclusion 18

Bibliography 22

2

List of Figures

1.1 General Overview of a GAN consisting of Discriminator(D) and Generator(G) 41.2 General Overview of a StarGAN consisting of Discriminator(D) and Gener-

ator(G) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 General Overview of a DIAT model . . . . . . . . . . . . . . . . . . . . . . . 72.2 An IcGAN model consisting of Encoder and a conditional GAN generator . 82.3 General Overview of a DiscoGAN consisting of Discriminator(D) and Gen-

erator(G) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 The three step pipeline for Face Detection and Alignment . . . . . . . . . . 113.2 Valence and Arousal metrics in 2D space . . . . . . . . . . . . . . . . . . . . 12

4.1 Generator Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Discriminator Network Architecture . . . . . . . . . . . . . . . . . . . . . . 144.3 Training Samples depicting the 7 different expressions . . . . . . . . . . . . 154.4 Test Results for Subject 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.5 Test Results for Subject 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3

Chapter 1

Introduction

We will be using a type of Generative Adversarial Network (GAN) [1]: StarGAN [2] in thisproject for emotion recognition using valence arousal scores. It is basically a multi-taskGAN which uses the Generator to produce fake images and the discriminator to identifythe real and fake images along with emotion recognition on the basis of attributes, i.e.valence arousal score in our ISO. Applications of emotion recognition are in many fieldssuch as the medical [3, 4, 5], marketing etc.

1.1 Generative Adversarial Networks(GAN)

A GAN is form of unsupervised learning that simultaneously trains two models: a gen-erative model G that captures the data distribution by using a latent noise [6] vector,and a discriminative model D that estimates the probability that a sample came from thetraining data rather than G [1].The training procedure is such that the G maximizes theprobability of D making a mistake. This framework corresponds to a minimax two-playergame. The two networks basically contests with each other in a zero-sum game framework.The D is trained to maximize the probability of assigning the correct label to both trainingexamples and samples from G. We train D to maximize the probability of assigning thecorrect label to both training examples and samples from G. We simultaneously train Gto minimize log(1 − D(G(z))). So, basically, the D and G minimax equation with valuefunction V (G;D) can be written as:

minG

maxD

V (D,G) = Ex∼pdata(x)[log(D(x))] + Ez∼p(z)[log(1−D(G(z)))] (1.1)

Usually a model of GAN looks like this:

Figure 1.1: General Overview of a GAN consisting of Discriminator(D) and Generator(G)

4

In adversarial nets framework, the generative model is put up against an adversary:a discriminative model that learns and determines whether a sample is from the modeldistribution or the data distribution. The generative model can be thought of as parallelto a team of money launderers, trying to produce fake money and use it without detection,while the discriminative model is parallel to the police, trying to detect the fake currency.Competition in this game drives both teams to improve their methods until the fake onesare indistinguishable from the real articles.

1.2 StarGAN

The StarGAN [2] is basically a type of GAN which solves the problem of multi-domainimage to image translation. The existing approaches reduces the robustness of a problemwhen used for multi-domains. So, basically for each pair of image domain we do not haveto create a different network. The task of image-to-image translation is to change a givenaspect of a given image to another,i.e, changing the facial expression of a person fromneutral to happy.The emotion dataset we have created has 7 different basic expressions which are the 7labels which we will use [7, 8].The basic structure of the StarGAN can be shown as below:

Figure 1.2: General Overview of a StarGAN consisting of Discriminator(D) and Genera-tor(G)

As attribute we have annotated the videos in the dataset based on their valence andarousal score.As domain we have the different videos pertaining to the same emotion which shares thesame attribute value i.e., angry, sad etc.

Conditional GAN-based image generation have been actively studied. Prior studies haveprovided both the discriminator and generator with class information in order to generatesamples conditioned on the class[9],[10].This idea of conditional image generation has beensuccessfully applied to domain transfer[11] which is a building block for this project.Recent works have achieved significant results in the area of image-to-image translation[11],[12].For instance, pix2pix [12] learns this task in a supervised manner using conditional GANs[9].It combines an adversarial loss with an L1 penalty loss, thus requires paired data samples.

The basic idea in the Star GAN approach is to train a single generator G that learnsmappings among multiple domains, in our case it is 7 basic expressions. To reach this, weneed to train G to translate [13] an input image x to an output image y conditioned on thetarget domain label c, G(x, c) −→ y. We randomly generate the target domain label c ,i.e.from valence or arousal score so that the generator learns to flexibly translate the inputimage. An auxiliary classifier[14] was introduced in the Star GAN approach that allows a

5

single discriminator to control multiple domains. So, the discriminator produces probabil-ity distributions over both source labels and domain labels, D : x −→ Dsrc(x), Dcls(x).

6

Chapter 2

Background

In this chapter we compare the StarGAN against recent facial expressions studies con-ducted by different approaches. These are basically the baseline models which performimage-to-image translation [15, 16] for different domains and how the Star GAN has im-proved upon these models. We explain four different baseline models here in this case:

• DIAT[17] or Deep identity-aware transfer uses an adversarial loss to learn the map-ping from i ∈ I to j ∈ J, where i and j are face images in two different domainsI and J, respectively. This method adds a regularization term on the mapping as‖i−F (G(i))‖1 to preserve identity features of the source image, where F is a featureextractor pretrained on a face recognition task.

Figure 2.1: General Overview of a DIAT model

• CycleGAN[18] uses an adversarial loss to learn the mapping between two differentdomains I and J. This method regularizes the mapping via cycle consistency losses,‖i− (GJI(GIJ(i)))‖1 and ‖j− (GIJ(GJI(j)))‖1. This method requires ’n’ generatorsand discriminators for each pair of ’n’ different domains. So, in our case we wouldneed 7 different generators and discriminators.

• IcGAN[19] or Invertible Conditional GAN basically combines an encoder with acGAN [9] model. cGAN learns the mapping G : {z,c} −→ i that generates an imagei conditioned on both the latent vector z and the conditional vector c. On top of

7

that the IcGAN introduces the encoder to learn the inverse mappings of cGAN,Ez : i −→ z and Ec : i −→ c. This allows IcGAN to synthesis images by onlychanging the conditional vector and preserving the latent vector.

Figure 2.2: An IcGAN model consisting of Encoder and a conditional GAN generator

• DiscoGAN[11] is the basic foundation baseline of StarGAN as Disco GAN intro-duces cross relation among different domains. The Disco GAN solves the relationsbetween different unpaired domains with different label values for each domain.

Figure 2.3: General Overview of a DiscoGAN consisting of Discriminator(D) and Genera-tor(G)

8

Chapter 3

Collecting and Preprocessing theDataset

In this chapter we specifically elaborate how the dataset was collected manually from in-ternet sources as in [20]. Following which face detection was performed on the videos andalignment on the faces was performed so it is easier for the StarGAN to train which re-sulted in individual frames/images. After that each frame/image was annotated based onits valence arousal score [21, 22].Overall around 484 vidoes were collected spanning more or less equally among the 7 dif-ferent expressions which resulted in ∼250K frames/images.

3.1 Collection of Dataset

We used the website https://www.shutterstock.com/ to collect the videos manually thatwas required for this ISO.The following expressions were elaborately searched along with [23, 24, 25] its synonymouswords throughout the website and the resulting videos divided into 7 different folders:

happy : glad, pleased, delighted, cheerful, ecstatic, joyful, thrilled, upbeat, overjoyed,excited, amused, astonished.neutral : serene, calm, at ease, relaxed, inactive, indifferent, cool, uninvolved.angry : enraged, annoyed, frustrated, furious, heated, irritated, outraged, resentful, fierce,hateful, ill-tempered, mad, infuriated, wrathful.disgust : antipathy, dislike, loathing, sickness.fear : horror, terror, scare, panic, nightmare, phobia, tremor, fright, creeps, dread, jitter, cold feet, cold sweet.sad : depressed, miserable, sorry, pessimistic, bitter, heartbroken, mournful, melancholy,sorrowful, down, blue, gloomy.surprise : awe, amazement, curious, revelation, precipitation, suddenness, unforeseen,unexpected.

Using these searches we developed the dataset folder and hence the number of videos/i-dentities per emotion is :

• Happy - 76 videos.

• Neutral - 75 videos.

• Sad - 71 videos.

9

https://www.shutterstock.com/

• Surprised - 70 videos.

• Angry - 72 videos.

• Disgust - 60 videos.

• Fear - 60 videos.

Total Number of videos/identities: 484 videos/identities.Total number of frames: 250K approx.

3.2 Preprocessing the dataset

In this section we preprocess the dataset using detection and alignment and then annotatethe frames/images based on their valence arousal score.

3.2.1 Face Detection and Alignment

The MTCNN[26] or Multi-Task Cascaded Convolutional Neural Networks [27] was usedfor the face detection and the five facial landmarks was used for the alignment of faces.This entire pipeline of Facial detection and alignment can be explained in the algorithmbelow:

Algorithm 1 Facial Detection and Alignment1: We use a fully Convolutional Network[28], here called Proposal Network (P-Net), to

obtain the candidate windows and their bounding box regression vectors in a similarmanner as in [29]. Then we use the estimated bounding box regression vectors tocalibrate the candidates. After that, we use non-maximum suppression (NMS)[30] tomerge highly overlapped candidates.

2: All the candidates are then fed to another CNN, called the Refine Network (R-Net),which further rejects a large number of false candidates, performs calibration withbounding box regression, and NMS candidates merge.

3: This step is similar to the second step, but in this step we attempt to describe the facein more details. In particular, the network will output five facial landmarks’ positionswhich is basically used to align the faces.

The three steps can also be explained as figure flow in the following page :

10

Figure 3.1: The three step pipeline for Face Detection and Alignment

11

3.2.2 Annotating the dataset

In this subsection the valence arousal score was given to each frame/image based on itspositive or negative range of emotions, as in [31, 32, 33]. We have defined the valencearousal score in the two dimensional space[34, 35, 36] with values ranging from -1 to +1for valance and arousal each.The following figure illustrates the metrics which we used to measure the extent of aparticular emotion and whether the emotion is a positive emotion pr negative emotion.

Figure 3.2: Valence and Arousal metrics in 2D space

Valence score is basically the measure of is positive or negative emotion of a particularperson.

Arousal score measures how calming or exciting the person is which means the intensityor extent of the particular emotion.

If we consider an example:For an angry image, the Valence score is ∼ −0.57 and Arousal score is ∼ 0.63where the negative sign on valence score ensures a negative emotion and a high arousalscore means he/she is very angry with a lot of excitement or energy on his/her face.Hence, this concludes the preprocessing part of the dataset. We collected a brand newdataset and preprocessed using detection, alignment and annotation.

12

Chapter 4

Method Implementation and Results

In this chapter we use the existing StarGAN model and use it on the dataset we generated.We train the discriminator and generator with the images and finally test the images fromour dataset on the trained model.

4.1 Model Architecture

We use the model architecture specified in the StarGAN paper[2] and train the generatorand the discriminator accordingly.For the generator network, we use instance normalization[37] in all layers except the lastoutput layer. For the discriminator network, we use Leaky ReLU[38] with a slope of -0.01.The generator network architecture is given below:

Figure 4.1: Generator Network Architecture

13

The discriminator network architecture is given below:

Figure 4.2: Discriminator Network Architecture

4.2 Training the dataset

For Generator, Total number of parameters: 8436800.For Discriminator, Total number of parameters: 44735424.The models are trained using Adam [39] with β1 = 0.5 and β2 = 0.999. We perform onegenerator update after five discriminator updates as in Wasserstein GANs[40]. The batchsize is set to 16 during training.To produce higher quality images and improve the training process we generate the adver-sarial loss with the Wasserstein GANs[40] objective with gradient penalty[41] which canbe defined as:

Ladv = Ex[Dsrc(x)]− Ex,c[Dsrc(G(x, c))]

−ΛgpEx̂[(‖∆x̂Dsrc(x̂)‖ − 1)2](4.1)

where x̂ is sampled uniformly along a straight line between a pair of a real and a generatedimages.After training the entire dataset we get the final training samples as illustrated in thefollowing page:

14

Input Angry Disgust Fear Happy Neutral Sad Surprised

Figure 4.3: Training Samples depicting the 7 different expressions

15

4.3 Testing the dataset

After we have trained the entire dataset and generated the images and the 7 different ex-pressions for the images we finally test the trained model using a few subjects to check itsaccuracy. We split the train and test data set as 90 percent and 10 percent respectively.

For subject 1


Figure 4.4: Test Results for Subject 1

16

For subject 2


Figure 4.5: Test Results for Subject 2

So, as we can clearly see the trained model easily generates the 7 basic expressions for agiven input image.Evaluation of the Discriminator after 20000 iterations is:Iteration [20000/20000],D/loss_real: -69.1152, D/loss_fake: 22.4376, D/loss_cls: 0.0002, D/loss_gp: 1.2612,Evaluation of the Generator after 20000 iterations is:Iteration [20000/20000],G/loss_fake: -25.4440, G/loss _rec: 0.1537, G/loss _cls: 1.2868.

17

Chapter 5

Conclusion

In conclusion, it can be clearly observed that the dataset gets classified into the 7 basicdifferent expressions based on its valence arousal score.We first preprocess the data using face detection, alignment and annotation; followingwhich divide it into training and testing set. After that we feed the data set into theStarGAN model which then completes the training on 20000 iterations and around 250Kframes/images. Finally, we test the pretrained model on some sample subjects and see howaccurate the images get generated according to the valence arousal scores of each emotion[42].We can further improve the existing model by increasing the number of iterations whichcan produce even better accuracy on the test set.We can also increase the number of subjects in the training datasets. For example we usedaround 500 different identities for training. This number can be further increased whiledataset collection and preprocessing.

18

Bibliography

[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in neural information processing systems, pages 2672–2680, 2014.

[2] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and JaegulChoo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 8789–8797, 2018.

[3] Athanasios Tagaris, Dimitrios Kollias, and Andreas Stafylopatis. Assessment ofparkinson’s disease based on deep neural networks. In International Conference onEngineering Applications of Neural Networks, pages 391–403. Springer, 2017.

[4] Athanasios Tagaris, Dimitrios Kollias, Andreas Stafylopatis, Georgios Tagaris, andStefanos Kollias. Machine learning for neurodegenerative disorder diagnosis—surveyof practices and launch of benchmark dataset. International Journal on ArtificialIntelligence Tools, 27(03):1850011, 2018.

[5] Dimitrios Kollias, Athanasios Tagaris, Andreas Stafylopatis, Stefanos Kollias, andGeorgios Tagaris. Deep neural architectures for prediction in healthcare. Complex &Intelligent Systems, 4(2):119–131, 2018.

[6] Konstantinos A Raftopoulos, Stefanos D Kollias, Dionysios D Sourlas, and MarinFerecatu. On the beneficial effect of noise in vertex localization. International Journalof Computer Vision, 126(1):111–139, 2018.

[7] Dimitrios Kollias, Athanasios Tagaris, and Andreas Stafylopatis. On line emotiondetection using retrainable deep neural networks. In 2016 IEEE Symposium Series onComputational Intelligence (SSCI), pages 1–8. IEEE, 2016.

[8] Dimitrios Kollias, Miao Yu, Athanasios Tagaris, Georgios Leontidis, Andreas Stafy-lopatis, and Stefanos Kollias. Adaptation and contextualization of deep neural networkmodels. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pages1–8. IEEE, 2017.

[9] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXivpreprint arXiv:1411.1784, 2014.

[10] Augustus Odena. Semi-supervised learning with generative adversarial networks.arXiv preprint arXiv:1606.01583, 2016.

[11] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learningto discover cross-domain relations with generative adversarial networks. In Proceedingsof the 34th International Conference on Machine Learning-Volume 70, pages 1857–1865. JMLR. org, 2017.

19

[12] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-lation with conditional adversarial networks. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 1125–1134, 2017.

[13] Georgios Goudelis, Konstantinos Karpouzis, and Stefanos Kollias. Exploring tracetransform for robust human action recognition. Pattern Recognition, 46(12):3238–3248, 2013.

[14] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthe-sis with auxiliary classifier gans. In Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, pages 2642–2651. JMLR. org, 2017.

[15] Dimitrios Kollias, Shiyang Cheng, Maja Pantic, and Stefanos Zafeiriou. Photoreal-istic facial synthesis in the dimensional affect space. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 0–0, 2018.

[16] Dimitrios Kollias, Shiyang Cheng, Evangelos Ververas, Irene Kotsia, and StefanosZafeiriou. Generating faces for affect analysis. arXiv preprint arXiv:1811.05027, 2018.

[17] Mu Li, Wangmeng Zuo, and David Zhang. Deep identity-aware transfer of facialattributes. arXiv preprint arXiv:1610.05586, 2016.

[18] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of theIEEE international conference on computer vision, pages 2223–2232, 2017.

[19] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. In-vertible conditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016.

[20] Dimitrios Kollias and Stefanos Zafeiriou. Expression, affect, action unit recognition:Aff-wild2, multi-task learning and arcface.

[21] Dimitrios Kollias and Stefanos Zafeiriou. Aff-wild2: Extending the aff-wild databasefor affect recognition. arXiv preprint arXiv:1811.07770, 2018.

[22] Dimitrios Kollias and Stefanos Zafeiriou. A multi-task learning & generationframework: Valence-arousal, action units & primary expressions. arXiv preprintarXiv:1811.07771, 2018.

[23] Anastasios D Doulamis, Yannis S Avrithis, Nikolaos D Doulamis, and Stefanos DKollias. Interactive content-based retrieval in video databases using fuzzy classificationand relevance feedback. In Proceedings IEEE International Conference on MultimediaComputing and Systems, volume 2, pages 954–958. IEEE, 1999.

[24] Nikos Simou, Th Athanasiadis, Giorgos Stoilos, and Stefanos Kollias. Image index-ing and retrieval using expressive fuzzy description logics. Signal, Image and VideoProcessing, 2(4):321–335, 2008.

[25] Nikolaos Simou and Stefanos Kollias. Fire: A fuzzy reasoning engine for impeciseknowledge. In K-Space PhD Students Workshop, Berlin, Germany, volume 14. Cite-seer, 2007.

[26] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection andalignment using multitask cascaded convolutional networks. IEEE Signal ProcessingLetters, 23(10):1499–1503, 2016.

20

[27] Dimitrios Kollias and Stefanos Zafeiriou. Training deep neural networks with differentdatasets in-the-wild: The emotion recognition paradigm. In 2018 International JointConference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.

[28] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-basedfully convolutional networks. In Advances in neural information processing systems,pages 379–387, 2016.

[29] Sachin Sudhakar Farfade, Mohammad J Saberian, and Li-Jia Li. Multi-view facedetection using deep convolutional neural networks. In Proceedings of the 5th ACMon International Conference on Multimedia Retrieval, pages 643–650. ACM, 2015.

[30] Alexander Neubeck and Luc Van Gool. Efficient non-maximum suppression. In 18thInternational Conference on Pattern Recognition (ICPR’06), volume 3, pages 850–855.IEEE, 2006.

[31] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou, Athanasios Papaioannou,Guoying Zhao, and Irene Kotsia. Aff-wild: Valence and arousal’in-the-wild’challenge.In Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, pages 34–41, 2017.

[32] Dimitrios Kollias, Mihalis A Nicolaou, Irene Kotsia, Guoying Zhao, and StefanosZafeiriou. Recognition of affect in the wild using deep neural networks. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition Workshops,pages 26–33, 2017.

[33] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou,Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. Deep affect pre-diction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond.International Journal of Computer Vision, 127(6-7):907–929, 2019.

[34] Emery Schubert. Measuring emotion continuously: Validity and reliability of the two-dimensional emotion-space. Australian Journal of Psychology, 51(3):154–165, 1999.

[35] Dimitrios Kollias and Stefanos Zafeiriou. A multi-component cnn-rnn approach fordimensional emotion recognition in-the-wild. arXiv preprint arXiv:1805.01452, 2018.

[36] Dimitrios Kollias and Stefanos Zafeiriou. Exploiting multi-cnn features in cnn-rnnbased dimensional emotion recognition on the omg in-the-wild dataset. arXiv preprintarXiv:1910.01417, 2019.

[37] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: Themissing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.

[38] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectifiedactivations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.

[39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.

[40] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron CCourville. Improved training of wasserstein gans. In Advances in Neural InformationProcessing Systems, pages 5767–5777, 2017.

[41] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adver-sarial networks. In International Conference on Machine Learning, pages 214–223,2017.

21

[42] Dimitris Kollias, George Marandianos, Amaryllis Raouzaiou, and Andreas-GeorgiosStafylopatis. Interweaving deep learning and semantic techniques for emotion analysisin human-machine interaction. In 2015 10th International Workshop on Semantic andSocial Media Adaptation and Personalization (SMAP), pages 1–6. IEEE, 2015.

22

Documents

Emotion Generation and Recognition: A StarGAN Approach · [17]Mu Li, Wangmeng Zuo, and David Zhang. Deep identity-aware transfer of facial attributes. arXivpreprintarXiv:1610.05586,2016