Dual Adversarial Network for Unsupervised Ground/Satellite-to … · 2021. 2. 18. · Ground/Satellite-to-Aerial Scene Adaptation Jianzhe Lin [email protected] University of British

Dual Adversarial Network for UnsupervisedGround/Satellite-to-Aerial Scene Adaptation

Jianzhe [email protected]

University of British Columbia

Lichao [email protected]

Technical University of Munich,German Aerospace Center

Tianze Yu∗[email protected]


Xiaoxiang [email protected]

Technical University of Munich,German Aerospace Center

Z. Jane [email protected]


(a) Harbor (b) Forest (c) Residence (d) Beach (e) Parking Lot

Collected from Satellite

Collected from UAV

Collected from Land

Gro

und

Vie

wA

eria

l Vie

wSa

telli

te V

iew

Figure 1: Examples of scenes from top-down views. From top to down are scenes from the satellite view, the aerial view, andthe ground view. Scenes from the satellite view are with much lower resolution and clarity compared with the aerial view.Scenes from the ground view and the aerial view are with huge domain gap even with the consistent semantic labels.

ABSTRACTRecent domain adaptation work tends to obtain a uniformed repre-sentation in an adversarial manner through joint learning of thedomain discriminator and feature generator. However, this domainadversarial approach could render sub-optimal performances dueto two potential reasons: First, it might fail to consider the taskat hand when matching the distributions between the domains.Second, it generally treats the source and target domain data in thesame way. In our opinion, the source domain data which serves thefeature adaption purpose should be supplementary, whereas the

∗Corresponding author.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, October 12–16, 2020, Seattle, WA, USA© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00https://doi.org/10.1145/3394171.3413893

target domain data mainly needs to consider the task-specific clas-sifier 1. Motivated by this, we propose a dual adversarial networkfor domain adaptation, where two adversarial learning processesare conducted iteratively, in correspondence with the feature adap-tation and the classification task respectively. The efficacy of theproposed method is first demonstrated on Visual Domain Adap-tation Challenge (VisDA) 2017 challenge, and then on two newlyproposed Ground/Satellite-to-Aerial Scene adaptation tasks. Forthe proposed tasks, the data for the same scene is collected notonly by the traditional camera on the ground, but also by satel-lite from the out space and unmanned aerial vehicle (UAV) at thehigh-altitude. Since the semantic gap between the ground/satellitescene and the aerial scene is much larger than that between groundscenes, the newly proposed tasks are more challenging than tradi-tional domain adaptation tasks. The datasets/codes can be found athttps://github.com/jianzhelin/DuAN.

CCS CONCEPTS• Computing methodologies→ Computer vision tasks.

1A task-specific classifier means the classifier trained for a specific task such as objectclassification or semantic segmentation. In this paper, our task is image classification.

Oral Session A1: Deep Learning for Multimedia MM '20, October 12–16, 2020, Seattle, WA, USA

10

https://doi.org/10.1145/3394171.3413893

KEYWORDSDomain Adaptation, Ground/Satellite-to-Aerial Scene, Task-specific

ACM Reference Format:Jianzhe Lin, Lichao Mou, Tianze Yu, Xiaoxiang Zhu, and Z. Jane Wang. 2020.Dual Adversarial Network for Unsupervised Ground/Satellite-to-AerialScene Adaptation. In Proceedings of the 28th ACM International Conferenceon Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA. ACM, NewYork, NY, USA, 9 pages. https://doi.org/10.1145/3394171.3413893

1 INTRODUCTIONRecent advances in deep learning not only bring impressive per-formance for data processing, but also aggravate the burden ofdata annotation. To train a reliable deep neural network, excessiveannotated data with labels are required. This annotation concern issevere for remote sensing data. Nowadays, with much easier accessto this type of data, annotation of newly collected remote sensingdata has become a big problem, as human labor for annotation isexpensive, and limited prior knowledge exists for remote sensingdata.

Domain adaptation might solve this problem in a straight for-ward manner. By domain adaptation, the label-scarce remote sens-ing data (the target domain) can borrow information directly fromthe label-rich regular RGB image data (the source domain). As datafrom such two domains are hard to be aligned, effective adaptationis challenging. This task is even more challenging when the targetremote sensing samples are totally unlabeled. In this work, we pro-pose a novel unsupervised domain adaptation (UDA) method totackle the above challenge.

A popular research direction of UDA is based on adversariallearning, which is to align data with different distributions in anadversarial manner: A feature generator is trained to generate thedomain invariant features for both source and target domain sam-ples, in order to fool a domain discriminator which is trained todiscriminate the domain labels of the features generated by thegenerator [1][22].

However, there are two potential limitations of the above adver-sarial learning based UDA. First, this method might not be task-specific. The adapted target domain data could lose its discrim-inative data distribution, which is essential for its classification[14][21][11]. The generated aligned feature vectors of the targetdata might not perform well in task-specific classifiers. Second, thesource and target domain data are treated in the same way duringthe adaptation process. To be more specific, raw data from two dif-ferent domains pass through a standard feature generator and thena task-specific classifier. Such a process may not be preferred as thedata from two domains should serve for different purposes: The tar-get domain data needs to serve for task-specific classifiers, whereasthe source domain data should be supplementary. The objective forsource domain data is mainly related to the feature adaptation butnot to the classification task. To make the two domains functionwell respectively for their own objective, we proposed the dualadversarial network.

In this work, we assign two domains with domain-specific tasks.The source domainmainly serves for the feature adaptation, whereasthe target domain would be task-specific. To achieve the task-specific goal with unlabeled target domain data, we introduce two

individual classifiers, which can classify source samples correctly,to provide inconsistent classification results for target domain datasimultaneously. The model loss will be generated by the inconsis-tency to optimize the target domain feature generator. The dualadversarial learning is proposed to complete the domain-specifictasks.

The proposed dual adversarial network(DuAN) includes fourplayers: Two task-specific classifiers, the source feature genera-tor, the target feature generator, and the domain discriminator. Acomparison between the proposed DuAN and the classical domainadversarial network can be found in Fig. 2. In the first adversariallearning phase, the source domain feature generator generates fea-tures by mimicking the target domain features which are fixed inthis phase to fool domain discriminator; For the second adversariallearning phase, task-specific classifiers whose weights are initial-ized by the source domain features generated in the first phaseyield inconsistent classification results to fool the target domainfeature generator: let it mistake the two classifiers are for differenttasks. Such a feature generator is more like a “task discriminator": Itonly realizes that the two classifiers are for the same task when thetwo task-specific classifiers provide the same classification results.These two phases will iterate until the domain discriminator isfooled, and meanwhile the target feature generator does not getfooled. Compared with the traditional adversarial domain adapta-tion, our source domain feature generator only needs to generatefeatures for the feature adaptation, and thus the generated featuresare better aligned and adapted; The target domain feature generator,which does not participate in adaptation directly but only playsthe adversarial game with classifiers, would generate much morediscriminative features. Major contributions of this paper can besummarized as follows:• We propose separate feature generators to serve for domain-specific purposes (e.g., feature adaptation and classificationtask). The generated target domain features can better pre-serve the discriminative target domain data distribution.• We propose the Dual Adversarial Network (DuAN). Thenetwork is trained in a stepwise manner. Four “players"play two adversarial games in DuAN, one for the featureadaptation, and the other for the classification task.• We investigate a novel, challenging satellite/ground-to-aerial Scene Adaptation task (GSSA). This task not onlyexplores the effectiveness of domain adaptation for remotesensing data (satellite-to-aerial), but also aims to solve thelabel-scarce problem for the aerial scene (ground-to-aerial).Examples of data for GSSA are shown in Fig. 1.

2 RELATEDWORK2.1 Adversarial Domain AdaptationRecent years have witnessed the exploitation of adversarial domainadaptation, which stems from the technique proposed in [9]. Thistype of adversarial domain adaptation has one feature generator aswell as one domain discriminator [1][26][31]. The generated fea-tures from the two domains would be aligned together to fool thedomain discriminator until it cannot recognize which domain thefeatures come from. Such aligning in early time was realized by sim-ple batch normalization statistics[17][4][16][5], which aligned the


11

https://doi.org/10.1145/3394171.3413893

C

Source

TargetSource

TargetC

C2Source

Target

C1C2C1

(a)classicaldomainadversarialnetwork

(b)dualadversarialnetwork

Source

Target

target,class1 target,class2source,class1 source,class2

sourcefeatureadaptationtargetfeatureadaptationclassifierboundary

Figure 2: (Best Viewed in color.) Illustration of the mecha-nism comparison between the classical adaptation approachand the proposedDuAN. (a) The classifier cannot classify tar-get domain data well although two domain data are alignedwell, as they might fail to consider task-specific classifiersduring adaptation. (b) Two individual task-specific classi-fiers first trained on the source domain data provide incon-sistent classification results for the target domain data. Suchdiscrepancy would be minimized in an iterative way: 1. thesource data feature mimics the target data feature, 2. classi-fiers are updated based on the new source data distributionand provide new discrepancy, 3. the target data feature is up-dated to minimize such discrepancy. The target data will besuitable for various task-specific classifiers at last.

source and target domain to a canonical one. By further introducingloss to mix up data from both domains, it was more difficult forthe domain discriminator to classify the domains [10][25][30][39].However, such methods were not task-specific, which meant thegenerated feature might not work well on the classifier [24][2][19].Recently, the Maximum Classifier Discrepancy (MCD) method wasproposed to make the adversarial mechanism to be task-specificby constructing adversarial learning between task-specific classi-fiers and feature generator [8][13][12]. To be more specific, twotask-specific classifiers at the same time took features from thegenerator. The feature generator tried to fool the two classifiers bygenerating ambiguous features for input samples [20], while thetwo task-specific classifiers would try best to get the uniformedresults to avoid being fooled by the feature generator.

However, we have to point out that the MCD framework ignoresthe effectiveness of the feature generator. The data from the twodomains, which are for different tasks (the source domain datais mainly for feature transfer task and the target domain data ismainly for classification task), shouldn’t generate features in thesame way. The same feature generator for the source and targetdomain data might not provide powerful uniformed features if thedata from two domains are with a large semantic/feature gap. To

address this concern in challenging and more practical domainadaptation scenarios, we propose the DuAN method.

2.2 Ground/Satellite-to-Aerial SceneAdaptation (GSSA)

In this work, we mainly want to apply the domain adaptation toremote sensing data. Remote sensing data can be generally dividedinto Satellite data and Aerial data. Nowadays, with much easier ac-cess to such remote sensing data, its annotation is a highly practicalconcern. We first explore the relationship between different typesof remote sensing data by domain adaptation between the satellitescene and the aerial scene. We then explore domain adaptation tohelp with the annotation of remote sensing data, taking advantageof the ground scene data. We name these two tasks as GSSA tasks.Examples for such tasks are shown in Fig. 1. We assume that imagedata captured from different views under the same scene class haveconsistent underlying intrinsic semantic characteristics, althoughwith a large feature gap. With rich information transferred fromthe ground view data that can be easily obtained from ImageNet[6] or SUN [35], the understanding and annotation of label-scarceaerial data can be better served.

Previously, works for addressing this cross-view (ground-to-aerial) domain adaption problem was mainly based on image geolo-calization [33]. There were also works [28][29][27][7] that assumedthe scene transfer from ground to aerial as a particular case of cross-domain adaptation, in which the divergences across domains werecaused by viewpoint changes. However, all existing methods werebased on relatively simple models and were tested on small datasets.There is no existing benchmark for this challenging task. In thispaper, we for the first time propose a uniformed GSSA benchmarkfor the domain adaptation task.

3 MODELIn this section, an overview of the proposed Dual Adversarial Net-work (DuAN) is given to present a comprehensive picture. After-ward, the model initialization and training are described respec-tively.

3.1 OverviewAs illustrated in Fig. 3. Five components exist in our framework:the domain discriminator 𝐷1, the source feature generator 𝐺1, thetarget feature generator 𝐺2, the classifier 𝐶1, and the classifier 𝐶2.The general process is separated into two parts, model initializationand parameter learning. The feature generators 𝐺1 and 𝐺2, andthe domain discriminator 𝐷1 are initialized by adversarial learning,while classifier 𝐶1, 𝐶2 are initialized by classification on the sourcedomain features. Parameters of every component are learned in astepwisemanner. First,𝐺2 as “task discriminator" is optimized basedon classification discrepancy between 𝐶1 and 𝐶2, and the outputfeature of 𝐺2 is updated; Second, the parameters of 𝐺1 and 𝐷1 areoptimized by feature discrepancy between the newly generated 𝐺2feature and the former𝐺1 feature, and the new𝐺1 feature generated;Third, 𝐶1 and 𝐶2 are optimized by the cross-entropy loss based onthe 𝐺1 feature. The updated 𝐶1 and 𝐶2 will further return to stepone to update 𝐺2. The three steps will iterate until convergence. Inthe process, 𝐺2 is fully task-specific, whereas the major task of 𝐺1


12

G1

G2

2

D1

D2

D2

C1

C2

Ini, Step 2

Ini, Step 2GRL

GRL

Feature Level Adversarial Loss

Ini

Ini

Feature Generator

Domain Discriminator

Classifiers

Step 1 (Minimize discrepancy)

Step 3 (Maximize discrepancy)

Cross entropy classification loss

Classifier discrepency loss

Source data flow Target data flow Combined flow Adversarial Tensor sum Discrepency calculator

Step 3

Figure 3: The flowchart of the proposedDuAN. Two adversarial processes exist, where one for the feature adaptation is realizedby the source flow (orange color), and the other for the classification task is realized by the target flow (purple color). Flowhere means the forward and backward propagation in the neural network. Steps 1-3 refer to the three iterative training steps.Components in the corresponding step are updated iteratively. “Ini" is the abbreviation for model initialization.

is to generate features of the source domain to mimic target domainfeatures. These three steps are illustrated in Fig.3.

The inputs of general framework is formulated as follows. Thelabeled source domain data is representedwith𝑋𝑠 = {𝑥𝑖𝑠 , 𝑦𝑖𝑠 }

𝑁𝑠

𝑖=0, andthe unlabeled target domain data is represented with 𝑋𝑡 = {𝑥𝑖𝑡 }

𝑁𝑡

𝑖=0.𝑁𝑠 and 𝑁𝑡 represent the numbers of data on the two domainsrespectively. The source domain feature set 𝐹𝑠 = {𝑓 𝑖𝑠 , 𝑦𝑖𝑠 }

𝑁𝑠

𝑖=0 withknown labels 𝑦𝑠 is first generated by 𝑓𝑠 = 𝐺1{𝑥𝑠 ;\𝐺1 }, in which\𝐺1 means the parameters of 𝐺1. The target domain feature set isgenerated by 𝑓𝑡 = 𝐺2{𝑥𝑡 ;\𝐺2 } in which 𝐺2 is the target featuregenerator and \𝐺2 means its parameters.

3.2 Model InitializationThe model is first initialized conventionally. The source and targetdomain features are the inputs to the domain discriminator, whichis represented as 𝐷1{𝑓𝑠 , 𝑓𝑡 ;\𝐷1 }. The two generators would tryto fool 𝐷1 while 𝐷1 would be maximized to classify the features’domain labels. At the same time, the two classifiers assign labels tothe source domain features, based on the regular cross-entropy loss.These two classifiers are formulated as 𝐶1{𝑓𝑠 ;\𝐶1 } and 𝐶2{𝑓𝑠 ;\𝐶2 }.Our first min-max objective is

min\𝐶1 ,\𝐶2

max\𝐺1 ,\𝐺2 ,\𝐷1

𝛼1L𝑑1 (𝐷1,𝐺1,𝐺2)

+ 𝛽1L𝑡1 (𝐺1,𝐶1,𝐶2),(1)

where 𝛼1 and 𝛽1 are weights for the two losses, and we also defineL𝑑1 and L𝑡1 as

L𝑑1 (𝐷1,𝐺1,𝐺2) = E𝑥𝑡[log𝐷1 (𝐺2 (x𝑡 ;\G2 );\𝐷1 )

]+ 𝐸𝑓 𝑠

[log(1 − 𝐷1 (𝑓 𝑠 ;\D1 ))

],

(2)

L𝑡1 (𝐶1,𝐶2,𝐺1) = E𝑓 𝑠 ,𝑦𝑠 ,𝑧[−𝑦𝑠T log𝐶1 (𝑓 𝑠 ;\𝐶1 )

]+ E𝑓 𝑠 ,𝑦𝑠 ,𝑧

[−𝑦𝑠T log𝐶2 (𝑓 𝑠 ;\𝐶2 )

],

(3)

where𝑦𝑠 means the one-hot encoding of the labels of source domaindata. In both equations, 𝑓 𝑠 = 𝐺1 (𝑥𝑠 , 𝑧;\𝐺1 ) as defined earlier. Inour implementation, for both 𝐺1 and 𝐺2, we use resnet to extractthe features, and 𝐷1, 𝐶1, and 𝐶2 are regular resnet classifiers. Forthe above minmax objective, we solve the problem by updating\𝐺1 , \𝐺2 (freezing \𝐷1 , \𝐶1 , \𝐶2 ) and \𝐷1 , \𝐶1 , \𝐶2 (freezing \𝐺1 , \𝐺2 )alternatively. We can initialize all parameters of the proposed modelin this way.

3.3 Model TrainingAfter the initialization of the model parameters, we can get differedclassification results from 𝐶1 and 𝐶2. The following model trainingis divided into three steps.

Step 1 and classifier discrepancy loss: In this step, we usethe discrepancy loss to train the target feature generator 𝐺2, whileother components are frozen. The two classifiers try to fool𝐺2 withinconsistent classification results whereas 𝐺2 tries to generate thefeatures to make them look the same to avoid being fooled. Herewe introduce 𝐷2 to identify the difference between the results oftwo classifiers. D2 is only an identifier with no parameters. Theobjective of this step is to minimize the discrepancy loss defined inEq. 4 as

𝐿𝑑2 (𝐷2,𝐶1,𝐶2) = 𝐷2 (𝐶1 (f𝑡 ;\C1 ),𝐶2 (f𝑡 ;\C2 )). (4)Here L𝑑2 is the discrepancy loss between the two classifiers. The

only variable in this step is \𝐺2 . For 𝐷2, different from 𝐷1 whichis defined by the neural network, it is just an identifier which isdefined as

𝐷2 (𝑥,𝑦) =1𝑛

𝑁∑𝑛=1|x𝑛 − 𝑦𝑛 |, (5)

in which N is the total number of elements for x and y (x and yshould have the same number of elements). We use the L-1 normto calculate the difference between the two inputs.


13

Step 2 and feature adversarial loss: In this step, we train thefeature generator 𝐺1 and the domain discriminator 𝐷1 in an adver-sarial manner, with all other components being frozen. Differentfrom the traditional UDA, only features from the feature generator𝐺1 are updated to appear as if generated from𝐺2, to fool 𝐷1 whichwould try best to discriminate the features from two domains. Theobjective of this step is to minimize the discrepancy between sourceand target domain features by 𝐷1, which is formulated in Eq. 6 as

min\𝐺1 ,\𝐷1

L𝑑1 (𝐷1,𝐺1,𝐺2), (6)

in which L𝑑1 is the feature adversarial loss defined in Eq. 2. Suchloss will optimize the network parameters in a Gradient ReverseLearning (GRL) [9] way, as higher loss means worse adaptationperformance. The variables to be optimized in this step are \𝐺1 and\𝐷1 . After this step, the feature output of𝐺1 is updated, which willbe used to optimize the classifiers. However, as 𝐺2 is not involvedin this step, its generated target domain feature is only related tothe classification task.

Step 3 and cross-entropy loss: In this step, we train𝐶1 and𝐶2with other components being frozen. This step has two objectives,the first is to make the two classifiers as dissimilar as possible, forthe adversarial purpose as in Step 1. The second objective is tomaximize the classification accuracy of both classifiers for featuresfrom𝐺1 byminimizing cross-entropy losses, which is a task-specificobjective. To jointly consider these two objectives, the objectivefunction is defined as

max\𝐶1 ,\𝐶2

𝛼2𝐿𝑑2 (𝐷2,𝐶1,𝐶2)

+ 𝛽2𝐿𝑡2 (𝐺1,𝐶1,𝐶2),(7)

where 𝛼2 and 𝛽2 are weights for the two losses, and 𝛼2/𝛽2 = 𝛼1/𝛽1.We define L𝑡2 the same as L𝑡1 in Eq.3, and define L𝑑2 the same asin Eq. 4.

For both 𝐶1 and 𝐶2, the input are the features from 𝐺1 and 𝐺2.Dual Adversarial Network Training: Detailed training pro-

cess for the Dual Adversarial Network can also be found in Alg. 1.In the algorithm, two adversarial parts exist. The first is betweenstep 1 and step 3, and the other is in step 2 between𝐺1 and 𝐷1. Thethree steps would iterate, not only until the classification results on𝐺1 are converged but also until: 1. The D1 gets fooled by 𝐺1 andcannot discriminate which domain the data are from; 2. The𝐺2 doesnot get fooled by 𝐶1 and 𝐶2, and realizes that the two classifiersare for the same task. We want to point out that these three stepscannot be integrated into one step, as step 1 and step 3 are withadversarial objectives, and inputs of the three steps are different.However, the order of the three steps is not important. These threesteps will iterate until convergence.

4 EXPERIMENTSIn the experimental part, we conduct our experiments on three tasks.The first is the Visual Domain Adaptation Challenge (VisDA) 2017challenge for image classification, the second is domain adaptationbetween two types of remote sensing scene (namely the satellitescene and the aerial scene), in order to explore the relationshipbetween them. The third is the Ground-to-Aerial scene Adaptationtask, which is the most challenging. Below we will first describethe datasets.

Algorithm 1: Training for DuAN.Input: image normalization for both the source and the target domain data;Output: the optimized weights for𝐺1 ,𝐺2 , 𝐷1 ,𝐶1 ,𝐶2 ;while epoch ≤ max epoch do

for 𝑏𝑎𝑡𝑐ℎ ← 1 to N doStep 1: Input the normalized target domain data with index (𝑁 + 1)/2→ 𝑁 ,optimize𝐺2 by minimizing Eq. 4;

Step 2: Input the normalized source domain data with index 0→ 𝑁 /2,optimize𝐺1 and 𝐷1 by minimizing Eq. 6;

Step 3: Input the normalized source domain data with index 0→ 𝑁 /2,optimize𝐶1 and𝐶2 by maximizing Eq. 7.

endend

4.1 Datasets and SetupVisDA 2017 challenge: we first evaluate our proposed DuAN modelon the VisDA 2017 challenge. A detailed introduction can be foundin the supplementary material.

Satellite to aerial scene adaptation: For this task, we collect nineclasses for domain adaptation, including River, Parking lot, Over-pass, Harbor, Forest, Building, Beach, Residential, Agricultural. Thedatasets are mainly collected from the WHU-RS dataset [34], theUCMerced dataset [38], as well as the data collected by ourselvesonline and through our collaborators. The data from the satelliteview is with much lower resolution and clarity when comparedwith the data from the aerial view. The data are re-scaled to theresolution of 256 × 256. There are 53 images/class for the sourcedomain, and 100 images/class for the target domain, and in total1377 images. A visualized comparison of these two types of remotesensing data is shown in the left of Fig. 4.

Ground to aerial scene adaptation: For this task, we include 15classes, as shown in Fig. 4. Each image is re-scaled to the resolutionof 256 × 256. Each class has 5,800 images (5,000 from the sourcedomain and 800 from the target domain), and the datasets contain87,000 images in total. We randomly choose 25,000 images from thesource domain for training, and use the trained model to test on thevalidation data, which are randomly chosen 5% target domain data.After validation, we use the rest target domain data for testing. Forthis task, the data from the ground view has a huge distributiongap when compared with the data from the aerial view, as can benoted in the examples. This task is highly challenging. Moreover,the similarity between classes in the same view also makes this taskdifficult. For example, the features of the parking lot are similarto that of the harbor, and the runway looks similar to the bridgefrom the aerial view. Also, we need to point out that “water park"corresponds to “water plant". We use these similar classes to set upthe pairs for this class due to the lack of “water park" in the aerialscene. Data examples for this task are shown in the right of Fig. 4.

For the network setup, we used Adam optimizer with learningrate 2×10−4 with no decay. The batch size is set to 64. For Eq. 1 andEq. 7, 𝛼/𝛽 = 0.1. The experiment results comparison of different𝛼/𝛽 can be found in Sec. 4.5. The above parameter settings aresuitable for all scenarios.

4.2 VisDA Challenge4.2.1 Setup. In this experiment, we first evaluated the performanceof the proposed DuAN model on the VisDA 2017 challenge. Dueto a large number of image data for training, we run the experi-ments on our server. For the hardware, the CPU is AMD Ryzen 2nd


14

Table 1: Accuracy(%) results for the VisDA 2017 Challenge task with ResNet-101 as the base network

Method Plane Bcycl Bus Car Horse Knife Mcycl Person Plant Sktbrd Train Truck AverageSource 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4DAN [18] 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1DANN [9] 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4MCD [8] 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9HAFN [36] 92.7 55.4 82.4 70.9 93.2 71.2 90.8 78.2 89.1 50.2 88.9 24.5 73.9SAFN [36] 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.7 9.9 55.6 89.0 24.4 76.1DuAN 96.4 84.3 80.9 82.4 97.3 86.9 92.1 77.4 92.5 78.4 74.1 29.2 80.2

Table 2: Accuracy(%) results for the Satellite-to-aerial Scene Adaptation task with ResNet-101 as the base network

Method River Parking lot Overpass Harbor Forest Building Beach Residential Agricultural AverageSource 53.0 0.0 0.0 4.0 44.0 14.0 0.0 22.0 52.0 21.0DANN [9] 53.0 0.0 0.0 92.0 98.0 0.0 0.0 0.0 28.0 24.2PADA [3] 75.0 94.0 83.0 84.0 50.0 21.0 83.0 80.0 69.0 71.0MEDA [32] 93.0 96.0 64.0 96.0 78.0 51.0 93.0 88.0 83.0 82.4JADA [15] 91.0 93.0 63.0 96.0 54.0 67.0 95.0 83.0 76.0 79.7HAFN [37] 75.0 91.0 58.0 90.0 79.0 33.0 86.0 70.0 73.0 72.7SAFN [36] 64.0 100.0 67.0 95.0 100.0 70.0 99.0 60.0 94.0 83.2MCD [8] 84.0 100.0 65.0 100.0 100.0 51.0 100.00 79.0 89.0 85.3SWD [13] 90.0 100.0 53.0 92.0 59.0 23.0 96.0 80.0 74.0 74.1DTA [14] 87.0 76.0 89.0 91.0 91.0 62.0 96.0 76.0 78.0 82.9DuAN 49.0 98.0 91.0 100.0 99.0 100.0 99.0 90.0 94.0 91.1

Figure 4: Left: Examples from the proposed Satellite-to-aerial domain adaptation datasets with 9 categories. Right: Examplesfrom the proposed ground-to-aerial domain adaptation datasets with 15 categories (except for classes in Fig. 1).

Airplane Baseball field Basketball court

Swimming Pool Golf fieldCross Walk Parking Space Runway

Water ParkBridge

Gro

und

View

Aer

ial V

iew

River

Overpass

Building

Agricultural

Aer

ial V

iew

Sate

llite

Vie

wA

eria

l Vie

wSa

telli

te Vi

ew

Gro

und

View

Aer

ial V

iew

(a) Source Only (b) Adapted (Ours) (c) Source Only (d) Adapted (Ours)

Satellite-to-Aerial Adaptation Ground-to-Aerial Adaptation

Figure 5: (a)-(b): t-SNE [23] visualization results of domain adaptation methods for the Satellite-to-aerial scene adaptation.(c)-(d): t-SNE [23] visualization results of domain adaptation methods for the Ground-to-aerial scene adaptation. We can seethat after applying our adaptation methods, the target samples are more discriminative.

Threadripper 2990WX, and GPU is NVIDIA RTX TITAN × 2, with128GB Memory. This hardware also works for the ground-to-aerial

scene adaptation task. We select ResNet-101 for this task as the basenetwork. All comparison methods are trained until convergence.


15

4.2.2 Results. Table 1 reports our results and also the results ob-tained from previous studies. We directly compare our results withthe reported results in previous papers to make the comparisonfair. As this part is only for method verification, we only make com-parisons with the representative methods DAN [18] and DANN[9], our baseline method MCD [8], and the most recently proposedmethod HAFN [36] and SAFN []. We can find from the table thatDuAN is always with the best performance from the perspectiveof average accuracy, followed by SAFN, HAFN, MCD, and others.DANN and DAN also get excellent performance for specific classeslike Bus and Train.

4.3 Satellite-to-aerial Scene Adaptation4.3.1 Setup. We run the experiments locally on computer, as thedataset for satellite-to-aerial scene adaptation is not large. For thehardware, the CPU we adapt is Intel® CoreTM i7-8700k, and GPUwe use is NVIDIA GEFORCE GTX 1080 TI. For this task, we adaptResNet-101 as our basenet. We implement each comparison methodall by ourselves, including the first adversarial domain adaptationwork DANN [9], the recent SOTA PADA [3], MEDA [32], JADA [15],HAFN [37], and SAFN [37] based on DANN, as well as three SOTAtask-specific methods [8], DTA [14], and [13]. The work in [13]and DTA [14] are generally modifications of MCD [8]. Therefore,we choose MCD as our major comparison method. We providenot only detailed accuracy comparison for each method but alsoa visualized t-sne comparison for the target data before (sourceonly) and after adaptation by our method, as shown in Fig. 5. Allmethods have been trained for 100 epochs, as testing accuracy ofevery method has converged at such epoch number. The trainedmodel for every ten epochs is tested directly on target domain datawithout validation as the size of the dataset is not big, and the bestperformance is reported to make a comparison.

4.3.2 Results. As can be found in Table 2, the proposed methodDuAN is with the best overall accuracy, followed by MCD, andSAFN. The accuracies of DANN and source only are both around30%. Also, we can find the building class is most difficult for clas-sification as it is quite easy to be confused with Residential class.But for our method, the accuracy of this class is 100%. By accuracycomparison with source only method, we can find the two typesof remote sensing scenes can be aligned by domain adaptation,which proves that the information between different types of re-mote sensing images can be shared and exchanged. Also, it can beconcluded from t-SNE comparison in Fig. 5: Although the targetsamples do not separate well in the non-adapted situation, they doseparate clearly in the adapted situation. Such a conclusion provesthe significance of the proposed satellite-to-aerial adaptation task,as information transfer between these two types of images can helpwith their classification.

4.4 Ground-to-aerial Scene Adaptation4.4.1 Setup. For this task, we use ResNet-101 as the base network.We show the detailed accuracy comparison in Table 3. The train-ing epochs are always set as 30, as all settings would converge atthis epoch number. For all comparison methods and the proposedmethod, as the target domain dataset is large, we use 5% randomlyselected target domain data for validation and the rest for testing.

Figure 6: The classification accuracies on validation data.

The trained model parameters with the lowest loss on the validationphase are used for testing. We also make detailed visualized t-sneresults comparison as in Fig. 5.

4.4.2 Results. As noted in Table 3, the overall accuracy (OA) ofthe proposed method is 64.40%, much higher than that of othermethods which are mostly lower than 50%. In the table, basketballcourt, baseball field, water park, parking lot and parking space areabbreviated as basketball., baseball., water., parking.L and parking.Srespectively. We want to provide two observations for this result.First, there is an indoor class baseball field for the ground scenebut outdoor for the aerial scene. Therefore, this class is with largerdomain gaps than the other classes. All methods fail and get thelowest classification accuracy for this class. Second, there is a pos-sibility that the source domain data are more discriminative thantarget domain data. Two representative classes are the swimmingpool and the basketball field, for which aerial view data are easily bemistaken and classified as the water park and the golf field respec-tively. For these classes, the loss of discriminative distribution oftarget domain data during the adaptation process might even resultin better performance. Such observation can explain our failure inthese classes compared with other methods. Also, this observationcan tell the low accuracies come from false classification insteadof random data noise. The t-SNE comparison between the adaptedresult and the source-only result proves the effectiveness of domainadaptation.

4.4.3 Model Training Observations. We take the ground-to-aerialadaptation task as an example to demonstrate the advantage ofour proposed method in terms of model training. Fig.6 shows thechanges in classification accuracy on validation data at differentepochs. For this task, the training times based on our hardware set-tings as mentioned in Sec. 4.2.1 is 10.5 mins/epoch. We also use ourbaseline method MCD [8] for comparison, whose training time is10.1 mins/epoch.We use themodel parameters trained at the epochswith the highest accuracies on validation data to do the testing forDuAN and MCD. We would like to mention two observations. First,from the perspective of convergence of classification, due toour stepwise model training, the classification result of the pro-posed DuAN stops to change at the 5th epoch while the result ofMCD takes much longer to get converged. Also, at the first epoch,DuAN already yields a quite high accuracy. We need to point out


16

Table 3: Accuracy(%) results for the Ground-to-aerial Scene Adaptation task with ResNet-101 as the base network

Method Airplan

e

Baseb

all.

Baske

tball.

Beach

Bridg

e

Crosswalk

Forest

Golf

Harbo

r

Park

ing.L

Park

ing.S

Residen

tial

Run

way

Swim

ming

Water.

Ave

rage

Source 0.25 0.38 6.62 0.00 27.25 49.88 70.88 0.00 1.50 0.25 0.12 0.00 0.00 17.38 2.00 11.77DANN [9] 35.38 1.00 17.50 0.00 0.25 49.25 0.00 5.62 0.00 1.00 0.25 0.12 0.50 41.38 0.12 10.16PADA [3] 39.43 0.26 25.03 66.89 52.46 43.24 21.58 46.37 3.43 28.34 21.57 13.44 2.94 4.66 1.25 24.73MEDA [32] 39.63 0.13 5.87 82.72 70.82 3.85 96.13 72.24 43.31 28.51 36.46 64.04 32.53 10.30 1.21 39.18JADA [15] 90.15 0.33 8.32 91.63 96.82 26.53 96.37 92.47 41.39 65.44 33.56 61.04 31.53 10.30 2.57 49.90HAFN [37] 89.36 0.47 2.03 63.45 87.82 5.68 94.67 89.46 61.48 62.73 40.51 54.01 20.33 4.34 0.62 45.13SAFN [37] 92.14 0.85 6.83 68.70 45.41 56.56 81.33 44.44 82.60 79.40 42.56 91.34 35.43 75.50 2.44 54.70MCD [8] 71.38 0.38 0.38 100.0 91.38 0.00 100.0 99.62 0.75 45.12 44.50 83.50 40.62 71.27 1.75 45.73SWD [13] 50.04 0.27 5.03 80.72 77.82 0.00 94.67 82.12 3.31 29.53 46.46 61.04 40.53 15.30 1.57 39.09DTA [14] 87.11 0.34 3.63 81.42 69.49 51.08 96.39 72.44 35.43 79.44 49.56 86.04 41.42 8.41 0.91 50.87DuAN 99.38 0.25 4.62 100.0 2.50 100.0 100.0 97.88 99.88 100.00 75.00 96.50 89.38 0.12 0.62 64.40

that for almost all UDA methods, the accuracy results are highest atthe first or second epoch and then reduce a bit. Second, from theperspective of convergence of adaptation, the discrepancy be-tween C1 and C2 in DuAN decreases much faster than in MCD. Asin DuAN, each domain is assigned a specific task, the classifiers canget consistent results much faster than in MCD. This suggests thatthe adaptation process can get converged much faster in a stepwisemanner, and we can obtain uniformed task-specific classificationresults in much shorter time.

4.5 Ablation StudyIn the ablation study part, we not only verify the effectivenessof each component of our network, but also compare differentparameters settings of 𝛼/𝛽 (𝛼1/𝛽2 in Eq. 1, and 𝛼2/𝛽2 in Eq. 7).We choose both the task Satellite-to-Aerial scene adaptation (StA-DA) and the task Ground-to-Aerial scene adaptation (GtA-DA) tocomplete the experiment.We use the average classification accuracyas the criterion for evaluation.

For the ablation study of each network component, the experi-mental result can be found in Table 4. From the model level, we sep-arate our model to three parts, domain-specific feature generator(D-SFG, we compare with one common feature generator for ablationstudy of this component), domain discriminator(DD), and two clas-sifiers (TC, we compare with one classifier for ablation study ofthis component). From the loss level, our network mainly includesthe the feature adversarial loss 𝐿𝑑1 , classifier discrepancy loss 𝐿𝑑2 ,and the cross-entropy loss 𝐿𝑡 (𝐿𝑡 = 𝐿𝑡1 = 𝐿𝑡2 as described in Sec.3.3). We need to notice that DD and 𝐿𝑑1 cannot be separated. Thesame goes for TC and 𝐿𝑑2 .

We can find from the table, the first setting can be regarded asthe regular domain adversarial adaptation with a common featuregenerator, a domain discriminator, and a classifier. The secondsetting is the same as the MCD method, and for the third we adda domain-specific feature generator. The last is the setting for theproposed method.

For the selection of parameter 𝛼/𝛽 , a comparison can be foundin Table 5. We select the representative numbers for 𝛼/𝛽 to makethe comparison.

Table 4: Mean value comparison.

Model Loss StA-DA GtA-DAD-SFG DD TC 𝐿𝑡 𝐿𝑑1 𝐿𝑑2 Accuracy Accuracy× ✓ × ✓ ✓ × 24.22 10.16× × ✓ ✓ × ✓ 85.33 45.73✓ × ✓ ✓ × ✓ 82.00 53.16✓ ✓ ✓ ✓ ✓ ✓ 91.11 64.40

We have to point out that we might have other choices of num-bers for the 𝛼/𝛽 (e.g. 0.5), but the performance does not change toomuch. Therefore, based on the results in Table 5, we set 𝛼/𝛽 as 0.1.

Table 5: The classification results comparison for different𝛼/𝛽 .

𝛼/𝛽 0.001 0.1 1 10 100StA-DA 46.00 91.11 86.44 82.70 77.22GtA-DA 23.54 64.40 55.63 54.21 46.37

5 CONCLUSIONIn this paper, we propose a novel adversarial domain adaptationmodel, named Dual Adversarial Network (DuAN), motivated by theidea that the source and target domain data should not be treatedin the same way in domain adaptation. Different from previousmethods, we propose a domain-specific strategy for the featureadaptation and the classification task, in order to relieve the loss ofdiscriminative characteristics of the target domain data during theadaptation process. The model is optimized in a stepwise manner.We also propose a novel “Ground/Satellite-to-Aerial Scene Adap-tation" task. This adaptation task is for a highly challenging andpractical scenario with larger domain gap when compared with tra-ditional domain adaptation tasks. Also, such an adaptation can helpto tackle the remote sensing data automatic annotation problem.The superior experiment results for both VisDA 2017 challenge andGSSA task prove the effectiveness of our proposed method.


17

REFERENCES[1] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. 2016.

Domain Separation Networks. In NIPS. 343–351.[2] Z. Cao, M. Long, J. Wang, and M. I. Jordan. 2018. Partial Transfer Learning with

Selective Adversarial Networks. In CVPR.[3] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang. 2018. Partial

adversarial domain adaptation. In ECCV. 135–150.[4] F. Maria Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. Rota Bulo. 2017. Autodial:

Automatic domain alignment layers. In CVPR. 5067–5075.[5] Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, and Bohyung Han.

2019. Domain-Specific Batch Normalization for Unsupervised Domain Adapta-tion. In CVPR. 7354–7362.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. Imagenet: Alarge-scale hierarchical image database. , 248–255 pages.

[7] Zhipeng Deng, Hao Sun, and Shilin Zhou. 2018. Semi-Supervised Ground-to-Aerial Adaptation with Heterogeneous Features Learning for Scene Classification.ISPRS International Journal of Geo-Information 7, 5 (2018), 182.

[8] Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. 2018. K.Saito and K. Watanabe and Y. Ushiku and T. Harada. In CVPR. 3723–3732.

[9] Y. Ganin and V. Lempitsky. 2015. Unsupervised domain adaptation by backprop-agation. , 1180–1189 pages.

[10] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell.2018. CyCADA: Cycle-Consistent Adversarial Domain Adaptation.

[11] Vinod Kumar Kurmi, Shanu Kumar, and Vinay P. Namboodiri. 2019. Attendingto Discriminative Certainty for Domain Adaptation. In CVPR. 491–500.

[12] Seiichi Kuroki, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato,and Masashi Sugiyama. 2019. Unsupervised domain adaptation based on source-guided discrepancy. In AAAI. 4122–4129.

[13] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. 2019.Sliced wasserstein discrepancy for unsupervised domain adaptation. In CVPR.10285–10295.

[14] Seungmin Lee, Dongwan Kim, Namil Kim, and Seong-Gyun Jeong. 2019. Drop toAdapt: Learning Discriminative Features for Unsupervised Domain Adaptation.In ICCV. 91–100.

[15] Shuang Li, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and GaoHuang. 2018. Joint Adversarial Domain Adaptation. In ACM MM. 729–737.

[16] Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. 2018. Adap-tive batch normalization for practical domain adaptation. PR 80 (2018), 109–117.

[17] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. 2016. Revisiting batch normalization forpractical domain adaptation. In arXiv:1603.04779.

[18] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learn-ing transferable features with deep adaptation networks. arXiv preprintarXiv:1502.02791 (2015).

[19] M. Long, Z. Cao, J. Wang, and M. I.Jordan. 2018. Conditional Adversarial DomainAdaptation. In NIPS. 1640–1650.

[20] Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, and Qing He. 2008. Transferlearning from multiple source domains via consensus regularization. In CIKM.103–112.

[21] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2019. Taking a CloserLook at Domain Shift: Category-Level Adversaries for Semantics ConsistentDomain Adaptation. In CVPR. 2507–2516.

[22] Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. 2019. GCAN: Graph Con-volutional Adversarial Network for Unsupervised Domain Adaptation. In CVPR.8266–8276.

[23] L.v.d Maaten and G. Hinton. 2008. Visualizing data using t-sne. JMLR 9, 6 (2008),2579–2605.

[24] Z. Pei, Z. Cao, M. Long, J. Wang, and J. Wang. 2018. Multi-Adversarial DomainAdaptation. In AAAI.

[25] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. 2018. Generateto adapt: Aligning domains using generative adversarial networks. In CVPR.8503–8512.

[26] B. Sun, J. Feng, and K. Saenko. 2016. Return of Frustratingly Easy DomainAdaptation.

[27] Hao Sun, Zhipeng Deng, Shuai Liu, and Shilin Zhou. 2016. Transferring groundlevel image annotations to aerial and satellite scenes by discriminative subspacealignment. In IGARSS. 2292–2295.

[28] Hao Sun, Shuai Liu, Shilin Zhou, andHuanxin Zou. 2015. Transfer sparse subspaceanalysis for unsupervised cross-view scene model adaptation. IEEE JSTARS 9, 7(2015), 2901–2909.

[29] Hao Sun, Shuai Liu, Shilin Zhou, and Huanxin Zou. 2015. Unsupervised cross-view semantic transfer for remote sensing image classification. IEEE GRSL 13, 1(2015), 13–17.

[30] Youssef Tamaazousti, Hervé Le Borgne, Céline Hudelot, Mohamed El Amine Sed-dik, and Mohamed Tamaazousti. 2019. Learning more universal representationsfor transfer-learning. IEEE T-PAMI (2019).

[31] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. 2017. Adversarial discriminativedomain adaptation. In CVPR. 7167—-7176.

[32] JindongWang,Wenjie Feng, Yiqiang Chen, Han Yu, Meiyu Huang, and Philip S Yu.2018. Visual domain adaptation with manifold embedded distribution alignment.In ACM MM. 402–410.

[33] Scott Workman, Richard Souvenir, and Nathan Jacobs. 2015. Wide-area imagegeolocalization with aerial reference imagery. In ICCV. 3961–3969.

[34] Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and HenriMaître. 2010. Structural high-resolution satellite image indexing. In ISPRS.

[35] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.2010. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.3485–3492.

[36] R. Xu, G. Li, J. Yang, and L. Lin. 2019. Larger Norm More Transferable: AnAdaptive Feature Norm Approach for Unsupervised Domain Adaptation. In The2019 IEEE International Conference on Computer Vision (ICCV) (2019).

[37] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019. Larger Norm MoreTransferable: An Adaptive Feature Norm Approach for Unsupervised DomainAdaptation. In Proceedings of the IEEE International Conference on Computer Vision.1426–1435.

[38] Yi Yang and Shawn Newsam. 2010. Bag-of-visual-words and spatial extensionsfor land-use classification. In Proceedings of the 18th SIGSPATIAL internationalconference on advances in geographic information systems. 270–279.

[39] Yongchun Zhu, Fuzhen Zhuang, and Deqing Wang. 2019. Aligning Domain-Specific Distribution and Classifier for Cross-Domain Classification fromMultipleSources. In AAAI. 5989–5996.


18

Documents

Dual Adversarial Network for Unsupervised Ground/Satellite-to … · 2021. 2. 18. · Ground/Satellite-to-Aerial Scene Adaptation Jianzhe Lin [email protected] University of British