10
Breaking the cycle—Colleagues are all you need Ori Nizan Technion, Israel [email protected] Ayellet Tal Technion, Israel [email protected] input ours [42] input ours-1 ours-2 [25] input ours [7] (a) glasses removal (b) selfie to anime (c) male to female Figure 1: Translation results for three applications. Our method is unidirectional (cycles are unnecessary) and multi- modal (multiple results are generated for a given input, e.g. (b)). The results are compared to SOTA results and shown to outperform them. In (a) our method completely removes the glasses; in (b) the shape of the face is well maintained, and in (c) the women look more ”feminine”, e.g., no beard leftovers. More results & comparisons can be found later in the paper. Abstract This paper proposes a novel approach to perform- ing image-to-image translation between unpaired domains. Rather than relying on a cycle constraint, our method takes advantage of collaboration between various GANs. This results in a multi-modal method, in which multiple op- tional and diverse images are produced for a given image. Our model addresses some of the shortcomings of classi- cal GANs: (1) It is able to remove large objects, such as glasses. (2) Since it does not need to support the cycle constraint, no irrelevant traces of the input are left on the generated image. (3) It manages to translate between do- mains that require large shape modifications. Our results are shown to outperform those generated by state-of-the-art methods for several challenging applications. 1. Introduction Mapping between different domains is inline with the human ability to find similarities between features in dis- tinctive, yet associated, classes. Therefore it is not surpris- ing that image-to-image translation has gained a lot of atten- tion in recent years. Many applications have been demon- strated to benefit from it, yielding beautiful results. In unsupervised settings, where no paired data is avail- able, shared latent space and cycle-consistency assumptions have been utilized [2, 7, 8, 18, 21, 26, 30, 39, 45, 47]. De- spite the successes & benefits, previous methods might suf- fer from some drawbacks. In particular, oftentimes, the cy- cle constraint might cause the preservation of source do- main features, as can be seen for example, in Figure 1(c), where facial hair remains on the faces of the women. This is due to the need to go back and forth through the cycle. Sec- ond, as discussed in [25], sometimes the methods are unsuc- cessful for image translation tasks with large shape change, such as in the case of the anime in Figure 1(b). Finally, as explained in [42], it is still a challenge to completely remove large objects, like glasses, from the images, and therefore this task is left for their future work (Figure 1(a)). We propose a novel approach, termed Council-GAN, which handles these challenges. The key idea is to rely 7860

Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel [email protected] Ayellet Tal Technion, Israel [email protected]

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

Breaking the cycle—Colleagues are all you need

Ori Nizan

Technion, Israel

[email protected]

Ayellet Tal

Technion, Israel

[email protected]

input ours [42] input ours-1 ours-2 [25] input ours [7]

(a) glasses removal (b) selfie to anime (c) male to female

Figure 1: Translation results for three applications. Our method is unidirectional (cycles are unnecessary) and multi-

modal (multiple results are generated for a given input, e.g. (b)). The results are compared to SOTA results and shown to

outperform them. In (a) our method completely removes the glasses; in (b) the shape of the face is well maintained, and in (c)

the women look more ”feminine”, e.g., no beard leftovers. More results & comparisons can be found later in the paper.

Abstract

This paper proposes a novel approach to perform-

ing image-to-image translation between unpaired domains.

Rather than relying on a cycle constraint, our method takes

advantage of collaboration between various GANs. This

results in a multi-modal method, in which multiple op-

tional and diverse images are produced for a given image.

Our model addresses some of the shortcomings of classi-

cal GANs: (1) It is able to remove large objects, such as

glasses. (2) Since it does not need to support the cycle

constraint, no irrelevant traces of the input are left on the

generated image. (3) It manages to translate between do-

mains that require large shape modifications. Our results

are shown to outperform those generated by state-of-the-art

methods for several challenging applications.

1. Introduction

Mapping between different domains is inline with the

human ability to find similarities between features in dis-

tinctive, yet associated, classes. Therefore it is not surpris-

ing that image-to-image translation has gained a lot of atten-

tion in recent years. Many applications have been demon-

strated to benefit from it, yielding beautiful results.

In unsupervised settings, where no paired data is avail-

able, shared latent space and cycle-consistency assumptions

have been utilized [2, 7, 8, 18, 21, 26, 30, 39, 45, 47]. De-

spite the successes & benefits, previous methods might suf-

fer from some drawbacks. In particular, oftentimes, the cy-

cle constraint might cause the preservation of source do-

main features, as can be seen for example, in Figure 1(c),

where facial hair remains on the faces of the women. This is

due to the need to go back and forth through the cycle. Sec-

ond, as discussed in [25], sometimes the methods are unsuc-

cessful for image translation tasks with large shape change,

such as in the case of the anime in Figure 1(b). Finally, as

explained in [42], it is still a challenge to completely remove

large objects, like glasses, from the images, and therefore

this task is left for their future work (Figure 1(a)).

We propose a novel approach, termed Council-GAN,

which handles these challenges. The key idea is to rely

17860

Page 2: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

on ”collegiality” between GANs, rather than utilizing a cy-

cle. Specifically, instead of using a single pair of a genera-

tor/discriminator ”experts”, it utilizes the collective opinion

of a group of pairs (the council) and leverages the variation

between the results of the generators. This leads to a more

stable and diverse domain transfer.

To realize this idea, we propose to train a council of mul-

tiple council members, requiring them to learn from each

other. Each generator in the council gets the same input

from the source domain and will produce its own output.

However, the outputs produced by the various generators

should have some common denominator. For this to happen

across all images, the generators have to find common fea-

tures in the input, which are used to generate their outputs.

Each discriminator learns to distinguish between the gen-

erated images of its own generator and images produced by

the other generators. This forces each generator to converge

to a result that is agreeable by the others. Intuitively, this

convergence assists to maximize the mutual information

between the source domain and the target domain, which

explains why the generated images maintain the important

features of the source images.

We demonstrate the benefits of our approach for sev-

eral applications, including glasses removal, face to anime

translation, and male to female translation. In all cases we

achieve state-of-the-art results.

Hence, this paper makes the following contributions:

1. We introduce a novel model for unsupervised image-

to-image translation, whose key idea is collabora-

tion between multiple generators. Conversely to most

recent methods, our model avoids cycle-consistency

constraints altogether.

2. Our council manages to achieve state-of-the-art results

in a variety of challenging applications.

2. Related work

Generative adversarial networks (GANs). Since the in-

troduction of the GAN framework [15], it has been demon-

strated to achieve eye-pleasing results in numerous appli-

cations. In this framework, a generator is trained to fool a

discriminator, whereas the latter attempts to distinguish be-

tween the generated samples and real samples. A variety of

modifications have been proposed in recent years in an at-

tempt to improve GAN’s results; see [3, 10, 11, 20, 24, 33,

36, 38, 40, 43, 46] for a few of them.

We are not the first to propose the use of multiple

GANs [12, 14, 17, 23]. However, previous approaches

differ from ours both in their architectures and in their

goals. For instance, some of previous architectures con-

sist of multiple discriminators and a single generator; con-

versely, some propose to have a key discriminator that can

evaluate the generators’ results and improve them. We pro-

pose a novel architecture to realize the concept of a council,

as described in Section 3. Furthermore, the goal of other ap-

proaches is either to push each other apart, to create diverse

solutions, or to improve the results. Our council attempts

to find the commonalities between the the source and target

domains. By requiring the council members to ”agree” on

each other’s results, they in fact learn to focus on the com-

mon traits of the domains.

Image-to-image translation. The aim is to learn a mapping

from a source domain to a target domain. Early approaches

adopt a supervised framework, in which the model learns

paired examples, for instance using a conditional GAN to

model the mapping function [22, 44, 48].

Recently, numerous methods have been proposed, which

use unpaired examples for the learning task and produce

highly impressive results; see for example [9, 13, 21, 26,

28, 30, 42, 47], out of a truly extensive literature. This ap-

proach is vital to applications for which paired data is un-

available or difficult to gain. Our model belongs to the class

of GAN models that do not require paired training data.

A major concern in the unsupervised approach is the

type of properties of the source domain that should be pre-

served. Examples include pixel values [41], pixel gradi-

ents [6], pairwise sample distances [4], and recently mostly

cycle consistency [26, 45, 47]. The latter enforces the con-

straint that translating an image to the target domain and

back, should obtain the original image. Our method avoids

using cycles altogether. This has the benefit of bypassing

unnecessary constraints on the generated output, and thus

avoiding to preserve hidden information [8].

Most existing methods lack diversity in the results. To

address this problem, some methods propose to produce

multiple outputs for the same given image [21, 28]. Our

method enables image translation with diverse outputs,

however it does so in a manner in which all GANs in the

council ”acknowledge” to some degree each other’s output.

Ensemble methods. These methods use multiple learning

algorithms, trained individually[34, 35, 37], whose predic-

tions are combined. They seek to promote diversity among

the models they combine. Conversely, we require the coun-

cil to learn together and ”converge” to agreeable solutions.

3. Model

This section describes our proposed model, which ad-

dresses the drawbacks described in Section 1. Our model

consists of a set, termed a council, whose members influ-

ence each other’s results. Each member of the council has

one generator and a couple of discriminators, as described

below. The generators need not converge to a specific out-

put; instead, each produces its own results, jointly generat-

ing a diverse set of results. During training, they take into

account the images produced by the other generators. Intu-

7861

Page 3: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

Figure 2: General approach. The council consists of

triplets, each of which contains a generator and two dis-

criminators: Di distinguishes between the generator’s out-

put and real examples, whereas D̂i distinguishes between

images produced by Gi and images produced by other gen-

erators in the council. D̂i is the reason that the each of the

generators converges to a result that is agreed-upon by all

other members of the council.

itively, the mutual influence enforces the generators to focus

on joint traits of the images in the source domain, which

could be matched to those in the target domain. For in-

stance, in Figure 1, to transform a male into a female, the

generators focus on the structure of the face, on which they

can all agree upon. Therefore, this feature will be preserved,

which can explain the good results.

Furthermore, our model avoids cycle constraints. This

means that there is no need to go in both directions be-

tween the source domain and the target domains. As a re-

sult, there is no need to leave traces on the generated image

(e.g. glasses) or to limit the amount of change (e.g. anime).

To realize this idea, we define a council of N members as

follows (Figure 2). Each member i of the council is a triplet,

whose components are a single generator Gi and two dis-

criminators Di & D̂i, 1 ≤ i ≤ N . The task of discrimina-

tor Di is to distinguish between the generator’s output and

real examples from the target domain, as done in any clas-

sical GAN. The goal of discriminator D̂i is to distinguish

between images produced by Gi and images produced by

the other generators in the council. This discriminator is the

core of the model and this is what differentiates our model

from the classical GAN model. It enforces the generator

to converge to images that could be acknowledged by all

council members—images that share similar features.

The loss function of Di is the classical adversarial loss

of [33]. Hereafter, we focus on the loss function of D̂i,

which makes the outputs of the various generators share

common traits, while still maintain diversity. At every it-

eration, D̂i gets as input pairs of (input,output) from all the

generators in the council. Rather than distinguishing be-

tween real & fake, D̂i’s distinguishes between the result

Figure 3: Zoom into the generator Gi. Our generator is an

auto-encoder architecture, which is similar to that of [21].

The encoder consists of several strided convolutional layers

followed by residual blocks. The decoder gets the encoded

image (termed the mutual information vector), as well as

a random entropy vector. The latter may be interpreted as

encoding the leftover information of the target domain. The

decoder uses a MLP to produce a set of AdaIN parameters

for the random entropy vector [19].

of ”my-generator” and the result of ”another-generator”.

Hence, during training, Gi attempts to minimize the dis-

tance between the outputs of the generators. Note that get-

ting the input and not only the output is important to make

the connection, for each pair, between the features of the

source image and those of the generated image.

Let Xs be the source domain and Xt be the target do-

main. In our model we have N mappings Gi : Xs → Xt.

Given an image x ∈ Xs, a straightforward adaptation of the

classical adversarial loss to our case would be:

Naive council lossi(Gi, D̂i, {Gj}j 6=i, Xs) = (1)

Ex∼p(Xs)

j 6=i

[log(1− D̂i(Gi(x), x))

+log(D̂i(Gj(x), x))],

where Gi tries to generate images Gi(x) that look similar

to images from domains Gj(x) for j 6= i. In analogy to

the classical adversarial loss, in Equation (1), both terms

should be minimized, where the left term learns to ”iden-

tify” its corresponding generator Gi as ”fake” and the right

term learns to ”identify” the other generators as ”real”.

To allow multimodal translation, we encode the input im-

age, as illustrated in Figure 3, which zooms into the struc-

ture of the generator [21]. The encoded image should carry

useful (mutual) information between domains Xs and Xt.

Let Ei be the ith encoder for the source image and let zi be

the ith random entropy vector, associated with the ith mem-

ber of the council, 1 ≤ i ≤ N . zi enables each generator to

generate multiple diverse results. Equation (1) is modified

so as to get an encoded image (instead of the original input

7862

Page 4: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

(a) Council discriminator D̂i (b) GAN discriminator Di

Figure 4: Differences & similarities between D̂i and Di. While the GAN discriminator distinguishes between ”real” and

”fake” images, the council discriminator distinguishes between outputs of its own generator and those produced by other

generators. Furthermore, while the GAN’s discriminator gets as input only the generator’s output, the council’s discriminator

gets also the generator’s input. This is because we wish the generator to produce a result that bares similarity to the input

image, and not only one that looks real in the target domain.

image) and the random entropy vector. The loss function of

D̂i is then defined as:

Council lossi(Gi, D̂i, {Gj}j 6=i, Xs, zi, {Ej}1≤j≤N ) = (2)

Ex∼p(Xs)

j 6=i

[log(1− D̂i(Gi(Ei(x), zi), x))

+log(D̂i(Gj(Ej(x), αzj), x))].

Here, the loss function gets, as additional inputs, all the en-

coders and vector zi. α controls the size of the sub-domain

of the other generators, which is important in order to con-

verge to ”acceptable” images.

Figure 4 illustrates the differences and the similarities

between discriminators Di and D̂i. Both should distinguish

between the generator’s results and other images; in the case

of Di the other images are real images from the target do-

main, whereas in the case of D̂i, they are images generated

by other generators in the council. Another fundamental

difference is their input: D̂i gets not only the generator’s

output, but also its input. This aims at producing a resulting

image that has common features with the input image.

Final loss. For each member of the council, we jointly train

the generator (assuming the encoder is included) and the

discriminators to optimize the final objective. In essence,

Gi, Di, & D̂i play a three-way min-max-max game with a

value function V (Gi, Di, D̂i):

minGi

maxDi

maxD̂i

V (Gi, Di, D̂i) (3)

= GAN lossi + λCouncil lossi.

This equation is a weighted sum of the adversarial

loss GAN Lossi (of Di), as defined in [33], and the

Council lossi (of D̂i) from Equation (2). λ controls the

importance of looking more ”real” or more inline with the

other generators. High values will result in more similar

images, whereas low values will require less agreement and

result in higher diversity between the generated images.

Focus map. For some applications, it is preferable to fo-

cus on specific areas of the image and modify only them,

leaving the rest of the image untouched. This can be easily

accommodated into our general scheme, without changing

the architecture.

The idea is to let the generator produce not only an im-

age, but also an associated focus map, which essentially

segments the learned objects in the domain from the back-

ground. All that is needed is to add a fourth channel,

maski, to the generator, which would generate values in

the range [0, 1]. These values can be interpreted as the like-

lihood of a pixel to belong to the background (or to an ob-

ject). To realize this, Equation (3) becomes

minGi

maxDi

maxD̂i

V (Gi, Di, D̂i) (4)

= GAN lossi + λ1Council lossi + λ2Focus lossi,

where

Focus lossi = δ(

k

maski[k])2

(5)

+∑

k

1

|maski[k]− 0.5|+ ǫ.

In Equation (5), maski[k] is the value of the 4th channel

for pixel k. The first term attempts to minimize the size of

the focus mask, i.e. make it focus solely on the object. The

second term is in charge of segmenting the image into an

object and a background (1 or 0). This is done in order to

avoid generating semi-transparent pixels. In our implemen-

tation ǫ = 0.01. The result is normalized by the image size.

The values of λ1 and λ2 are application-dependent and will

be defined for each application in Section 5.

Figure 5 illustrates the importance of the various losses.

If only the Focus loss (jointly with the GAN loss) is

7863

Page 5: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

input member1 member2 member3 member4

Figure 5: Importance of the loss function components.

This figure shows the results generated by the four council

members for the male-to-female application, after 100K it-

erations. Top: Using the Focus loss (jointly with the clas-

sical GAN loss) generates nice images from the target do-

main, which are not necessarily related to the given image.

Middle: Using the Council loss instead, relates between

the input and the output faces, but might change the envi-

ronment (background). Bottom: Our loss, which combines

the above losses, both relates the input and the output faces

and focuses only on facial modifications.

used, the faces of the input and the output are completely

unrelated, though the quality of the images is good and the

background does not change in most cases. Using only

the Council loss, the faces of the input and the output are

nicely related, but the background might change. Our loss,

which combines the above losses, produces the best results.

We note that this idea of adding a 4th channel, which

makes the generator focus on the proper areas of the image,

can be used in other GAN architectures. It is not limited to

our proposed council architecture.

4. Experiments

4.1. Experiment setup

We applied our council GAN to several challenging

image-to-image translation tasks (Section 4.2).

Baseline models. Depending on the application, we com-

pare our results to those of some state-of-the-art mod-

els, including CycleGAN [47], MUNIT [21], DRIT++ [28,

29], U-GAT-IT [25], StarGAN [7], Fixed-PointGAN [42].

These methods are unsupervised and use cycle constraints.

Out of these methods, MUNIT [21] and DRIT++ [28, 29]

are multi-modal and generate several results for a given im-

age. The others produce a single result. Furthermore, Star-

GAN [7] performs translation between multiple domains.

Datasets. We evaluated the performance of our system on

the following datasets.

CelebA [31]. This dataset contains 202, 599 face im-

ages of celebrities, each annotated with 40 binary attributes.

We focus on two attributes: (1) the gender attribute and

(2) with/without glasses attribute. The training dataset con-

tains 68, 261 (/10, 521) images of males (/with glasses) and

94, 509 (/152, 249) images of females (/without glasses).

The test dataset consists of 16, 173 (/2, 672) males (/with

glasses) and 23, 656 (/37, 157) females (/without glasses).

selfie2anime [25]. The size of the training dataset is

3, 400 selfie images and 3, 400 anime images. The size of

the test dataset is 100 selfie images and 100 anime images.

Training. All models were trained using Adam [27] with

β1 = 0.5 and β1 = 0.999. For data augmentation we

flipped the images horizontally with a probability of 0.5.

For the selfie/anime dataset , where the number of im-

ages is small, we augmented the data also with color jit-

tering with up to hue = 0.15, random Grayscale with a

probability of 0.25, random Rotation with up to 35◦, ran-

dom translation of up to 0.1 of the image, and with random

perspective with distortion scale of 0.35 with a probabil-

ity of 0.5. On the last 100K iterations we trained only on

the original data, without augmentation. We performed one

generator update after a number of discriminator updates

that is equal to the size of the council. The batch size was

set to 3 for all experiments. We trained all models with a

learning rate of 0.0001, where the learning rate drops by a

factor of 0.5 after every 100, 000 iterations. The focus and

council losses were added after 10, 000 iterations.

Computational cost. The training takes about twice the

time comparable to CycleGAN, when the council members

run sequentially on the same GPU. The longer time is due to

(1) having 4 members (2) a longer iteration of the council-

discriminator, and (3) twice as many iterations needed to

reach an agreement. It only takes twice as long since we

avoid learning of the reverse side (e.g. from anime to selfie).

To accelerate the computation, the members could be run in

parallel or a smaller council could be used.

Evaluation. We verify our results both qualitatively and

quantitatively. For the latter, we use two common mea-

sures: (1) the Frechet Inception Distance score (FID) [16],

which calculates the distance between the feature vectors of

the real and the generated images; (2) the Kernel Inception

Distance (KID) [5], which improves on FID and measures

GAN convergence.

4.2. Experimental results

Experimental results for male-to-female translation.

Given an image of a male face, the goal is to generate a

female face, which resembles the male face [1, 32]. As

explained in [1], three features make this translation task

challenging: (i) There is no predefined correspondence in

real data of each domain. (ii) The relationship is many-to-

many between domains, as many male-to-female mappings

7864

Page 6: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

input ours-1 ours-2 ours-3 ours-4 cycleGAN MUNIT StarGAN DRIT++

[47] [21] [7] [29, 28]

Figure 6: Male-to-female translation. Our results are more ”feminine” than those generated by other state-of-the-art meth-

ods, while still preserving the main facial features of the input images.

are possible. (iii) Capturing realistic variations in generated

faces requires transformations that go beyond simple color

and texture changes.

Figure 6 compares our results, generated by a council of

four members, to those of [7, 21, 29, 47]. Note that each of

the council member may generate multiple results, depend-

ing on the random entropy vector. We observe that our gen-

erated females are more ”feminine” (e.g., the beards com-

pletely disappear and the haircuts are longer), while still

preserving the main features of the source male face and

resemble it. This can be attributed to the fact that we do not

use a cycle to go from a male to a female and back, and thus

we do not need to preserve any masculine features. More

examples are given in the supplementary materials

Table 1 summarized our quantitative results, where our

results are randomly chosen from those generated by the

different members of the council. Our results outperform

those of other methods in both evaluation metrics.

Experimental results for selfie-to-anime translation.

Given an image of a human face, the goal is to generate

an appealing anime, which resembles the human. This is

a challenging task, as not only the style differs, but also

FID KID

CycleGAN [47] 20.91 0.0012

MUINT [21] 19.88 0.0013

starGAN [7] 35.50 0.0027

DIRT++ [29, 28] 26.24 0.0016

Council 18.85 0.0010

Table 1: Quantitative results for male-to-female trans-

lation. Our council generates results that outperform other

SOTA results. For both measures, the lower the better.

the geometric structure of the input and the output greatly

varies (e.g. the size of the eyes). This might lead to mis-

matching of the structures, which would lead to distortions

and visual artifacts. This difficulty is added to the three

challenges mentioned in the previous application: the lack

of predefined correspondence of the domains, the many-to-

many relationship, and going beyond color and texture.

Figure 7 shows our results using a council of four. Our

generated anime images are quite often better resemble the

input in terms of expression and face structure (i.e., the

shape of the chin) than those of [21, 25, 28, 29, 47]. This

7865

Page 7: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

input ours-1 ours-2 ours-3 ours-4 cycleGAN MUNIT U-GAT-IT DRIT++

[47] [21] [25] [29, 28]

Figure 7: Selfie-to-anime translation. Our results preserve the structure of the face in the input image, while generating the

characteristic features of anime, such as the large eyes.

FID KID

CycleGAN [47] 149.38 0.0056

MUINT [21] 131.69 0.0057

U-GAT-IT [25] 115.11 0.0043

DIRT++ [29, 28] 109.22 0.0020

Council 101.39 0.0020

Table 2: Quantitative results for selfie-to-anime transla-

tion. Our results outperform those of other methods when

FID is considered and are competitive for KID.

can be explained by the fact that it is easier for the coun-

cil members to ”agree” on features that exist in the input.

Table 2 shows quantitative results. It can be seen that our

results outperform or are competitive with those of other

methods in both evaluation metrics.

Experimental results for glasses removal. Given an im-

age of a person with glasses, the goal is to generate an image

of the same person, but with the glasses removed. While in

the previous application, the whole image changes, here the

challenge is to modify only a certain part of the face and

leave the rest of the image untouched.

FID KID

cycleGAN [47] 50.72 0.0038

Fixed-point GAN [42] 55.26 0.0041

Council 36.38 0.0026

Table 3: Quantitative results of glasses removal. The re-

sults of our council outperform state-of-the-art results

Figure 8 compares our results (using a council of four)

to those of [42], which shows results for this application,

as well as to [47]. Our generated images leave consider-

ably less traces of the removed glasses. Again, this can

be attributed lack of the cycle constraint. Table 3 provides

quantitative results. For this application as well, our coun-

cil manages to outperform other methods and address the

challenge of removing large objects.

5. Implementation

Our code is based on PyTorch; it is available at

https://github.com/Onr/Council-GAN. We set the major pa-

rameters as follows: α, which controls diversity (Equa-

7866

Page 8: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

input ours Fixed-Point cycleGAN

[42] [47]

Figure 8: Glasses removal. We show a single result per

input, since multi-modality is irrelevant for this applica-

tion. Our generated images remove the glasses almost com-

pletely, whereas traces are left in [42]’s and in [47]’s results.

tion (2)), is set to 0.8. δ, which controls the size of the

mask (Equation (5)), is set to 0.001. λ1 and λ2 from Equa-

tion (4) are set according to the applications: in male to

female λ1 = 0.2 & λ2 = 0.025; in selfie to anime λ1 = 0.5& λ2 = 0; in glasses removal λ1 = 0.2 & λ2 = 0.2.

Figure 9 studies the influence of the number of members

and the number of iterations on the quality of the results. We

focus on the male-to-female application, which is represen-

tative. The fewer the number of members in the council, the

faster the convergence is. However, this comes at a price:

the accuracy is worse. Furthermore, it can be seen that the

KID improves with iterations, as expected.

Limitations. Figure 10 demonstrates a limitation of our

method. When removing the glasses, the face might also

become more feminine. This is attributed to the imbal-

ance inherent to the dataset. Specifically, the ratio of the

men to women with glasses is 0.8, whereas the ratio of

men to women without glasses is only 0.4. The result of

this imbalance in the target domain is that removing glasses

also means becoming more feminine. This problem can

Figure 9: KID as a function of # iterations. The more iter-

ations, the better KID. Moreover, with more council mem-

bers, model converges more slowly, yet the results improve.

input result input result

(a) glasses removal (a) male to female

Figure 10: Limitation. (a) When removing the glasses, the

face also becomes more feminine. (b) Conversely, when

transforming a male to a female, the glasses may also be

removed. This is attributed to high imbalance of the relevant

features in the dataset.

be solved by providing a dataset with an equal number of

males and females with and without glasses. Handling fea-

ture imbalance without changing the number of images in

the dataset, is an interesting direction for future research.

6. Conclusion

This paper introduces the concept of a council of

GANs—a novel approach to perform image-to-image trans-

lation between unpaired domains. They key idea is to re-

place the widely-used cycle-consistency constraint by lever-

aging collaboration between GANs. Council members as-

sist each other to improve, each its own result.

Furthermore, the paper proposes an implementation of

this concept and demonstrates its benefits for three chal-

lenging applications. The members of the council generate

several optional results for a given input. They manage to

remove large objects from the images, not to leave redun-

dant traces from the input and to handle large shape modi-

fications. The results outperform those of SOTA algorithms

both quantitatively and qualitatively.

Acknowledgements: We gratefully acknowledge the sup-

port of the Israel Science Foundation (ISF) 1083/18, PMRI-

Peter Munk Research Institute–Technion, and NVIDIA

Corporation with the donation of the GPU.

7867

Page 9: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

References

[1] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and

A. Courville. Augmented CycleGAN: learning many-to-

many mappings from unpaired data. In 35th International

Conference on Machine Learning (ICML), 2018. 5

[2] A. Anoosheh, E. Agustsson, and R. Timofte. Combo-

gan: Unrestrained scalability for image domain translation.

In IEEE/CVF Conference on Computer Vision and Pattern

Recognition Workshops (CVPRW), pages 896–8967, 2018. 1

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener-

ative adversarial networks. In 34th International Conference

on Machine Learning (ICML), 2017. 2

[4] S. Benaim and L. Wolf. One-sided unsupervised domain

mapping. In Advances in neural information processing sys-

tems, 2017. 2

[5] M. Binkowski, D. J. Sutherland, M. Arbel, and A. Gretton.

Demystifying MMD GANs. In International Conference on

Learning Representations, 2018. 5

[6] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr-

ishnan. Unsupervised pixel-level domain adaptation with

generative adversarial networks. In IEEE conference on com-

puter vision and pattern recognition (CVPR), 2017. 2

[7] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo.

StarGAN: unified generative adversarial networks for multi-

domain image-to-image translation. In IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition, pages

8789–8797, June 2018. 1, 5, 6

[8] C. Chu, A. Zhmoginov, and M. B. Sandler. CycleGAN, a

master of steganography. CoRR, 2017. 1, 2

[9] L. M. David Berthelot, Tom Schumm. BEGAN: boundary

equilibrium generative adversarial networks. CoRR, 2017. 2

[10] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera-

tive image models using a laplacian pyramid of adversarial

networks. In Advances in neural information processing sys-

tems, pages 1486–1494, 2015. 2

[11] A. Dosovitskiy and T. Brox. Generating images with percep-

tual similarity metrics based on deep networks. In Advances

in Neural Information Processing Systems, pages 658–666,

2016. 2

[12] I. P. Durugkar, I. Gemp, and S. Mahadevan. Generative

multi-adversarial networks. In International Conference on

Learning Representations (ICLR), 2018. 2

[13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer

using convolutional neural networks. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages

2414–2423, 2016. 2

[14] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and

P. K. Dokania. Multi-agent diverse generative adversarial

networks. In IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 8513–8521, 2018. 2

[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,

D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-

erative adversarial nets. In Advances in neural information

processing systems, pages 2672–2680, 2014. 2

[16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and

S. Hochreiter. Gans trained by a two time-scale update rule

converge to a local nash equilibrium. In Advances in Neural

Information Processing Systems, pages 6626–6637, 2017. 5

[17] Q. Hoang, T. D. Nguyen, T. Le, and D. Phung. MGAN:

Training generative adversarial nets with multiple genera-

tors. In International Conference on Learning Representa-

tions, 2018. 2

[18] X. Hua, D. Rempe, and H. Zhang. Unsupervised cross-

domain image generation. In International Conference on

Learning Representations (ICLR), 2017. 1

[19] X. Huang and S. J. Belongie. Arbitrary style transfer in real-

time with adaptive instance normalization. 2017 IEEE In-

ternational Conference on Computer Vision (ICCV), pages

1510–1519, 2017. 3

[20] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie.

Stacked generative adversarial networks. In IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

pages 5077–5086, 2017. 2

[21] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal

unsupervised image-to-image translation. In The European

Conference on Computer Vision (ECCV), 2018. 1, 2, 3, 5, 6,

7

[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image

translation with conditional adversarial networks. In The

IEEE Conference on Computer Vision and Pattern Recog-

nition (CVPR), 2017. 2

[23] F. Juefei-Xu, V. N. Boddeti, and M. Savvides. Gang of

gans: Generative adversarial networks with maximum mar-

gin ranking. CoRR, 2017. 2

[24] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive

growing of GANs for improved quality, stability, and varia-

tion. In International Conference on Learning Representa-

tions (ICLR), 2018. 2

[25] J. Kim, M. Kim, H. Kang, and K. Lee. U-GAT-IT: un-

supervised generative attentional networks with adaptive

layer-instance normalization for image-to-image translation.

CoRR, abs/1907.10830, 2019. 1, 5, 6, 7

[26] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning

to discover cross-domain relations with generative adversar-

ial networks. In 34th International Conference on Machine

Learning-Volume 70 (ICML), 2017. 1, 2

[27] D. P. Kingma and J. Ba. Adam: A method for stochastic

optimization, 2014. 5

[28] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H.

Yang. Diverse image-to-image translation via disentangled

representations. In The European Conference on Computer

Vision (ECCV), 2018. 2, 5, 6, 7

[29] H.-Y. Lee, H.-Y. Tseng, Q. Mao, J.-B. Huang, Y.-D. Lu,

M. K. Singh, and M.-H. Yang. Drit++: Diverse image-to-

image translation via disentangled representations. arXiv

preprint arXiv:1905.01270, 2019. 5, 6, 7

[30] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-

image translation networks. In Advances in Neural Informa-

tion Processing Systems, 2017. 1, 2

[31] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face

attributes in the wild. In Proceedings of International Con-

ference on Computer Vision (ICCV), December 2015. 5

7868

Page 10: Breaking the Cycle - Colleagues Are All You Needopenaccess.thecvf.com/content_CVPR_2020/papers/... · Technion, Israel snizori@campus.technion.ac.il Ayellet Tal Technion, Israel ayellet@ee.technion.ac.il

[32] Y. Lu, Y.-W. Tai, and C.-K. Tang. Attribute-guided face gen-

eration using conditional CycleGAN. In European Confer-

ence on Computer Vision (ECCV), pages 282–297, 2018. 5

[33] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smol-

ley. Least squares generative adversarial networks. In IEEE

International Conference on Computer Vision (ICCV), 2017.

2, 3, 4

[34] D. Opitz and R. Maclin. Popular ensemble methods: An

empirical study. Journal of artificial intelligence research,

11:169–198, 1999. 2

[35] R. Polikar. Ensemble based systems in decision making.

IEEE Circuits and systems magazine, 6(3):21–45, 2006. 2

[36] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-

sentation learning with deep convolutional generative adver-

sarial networks. In 4th International Conference on Learning

Representations, ICLR, 2016. 2

[37] L. Rokach. Ensemble-based classifiers. Artificial Intelli-

gence Review, 33(1-2):1–39, 2010. 2

[38] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and

S. Mohamed. Variational approaches for auto-encoding gen-

erative adversarial networks. CoRR, 2017. 2

[39] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Mosseri,

F. Cole, and K. Murphy. Xgan: Unsupervised image-

to-image translation for many-to-many mappings. arXiv

preprint arXiv:1711.05139, 2017. 1

[40] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-

ford, and X. Chen. Improved techniques for training gans. In

Advances in neural information processing systems, 2016. 2

[41] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang,

and R. Webb. Learning from simulated and unsupervised

images through adversarial training. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages

2107–2116, 2017. 2

[42] M. M. R. Siddiquee, Z. Zhou, N. Tajbakhsh, R. Feng, M. B.

Gotway, Y. Bengio, and J. Liang. Learning fixed points

in generative adversarial networks: From image-to-image

translation to disease detection and localization. The IEEE

International Conference on Computer Vision (ICCV), 2019.

1, 2, 5, 7, 8

[43] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf.

Wasserstein auto-encoders. In International Conference on

Learning Representations (ICLR), 2018. 2

[44] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and

B. Catanzaro. High-resolution image synthesis and semantic

manipulation with conditional gans. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2018. 2

[45] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper-

vised dual learning for image-to-image translation. In IEEE

International Conference on Computer Vision (ICCV), pages

2849–2857, 2017. 1, 2

[46] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and

D. N. Metaxas. Stackgan: Text to photo-realistic image syn-

thesis with stacked generative adversarial networks. In IEEE

International Conference on Computer Vision (ICCV), pages

5907–5915, 2017. 2

[47] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-

to-image translation using cycle-consistent adversarial net-

works. In The IEEE International Conference on Computer

Vision (ICCV), 2017. 1, 2, 5, 6, 7, 8

[48] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,

O. Wang, and E. Shechtman. Toward multimodal image-to-

image translation. In Advances in Neural Information Pro-

cessing Systems, pages 465–476, 2017. 2

7869