9
CNN-based Semantic Segmentation using Level Set Loss Youngeun Kim * KAIST Seunghyeon Kim * KAIST Taekyung Kim KAIST Changick Kim KAIST {youngeunkim, seunghyeonkim, tkkim93, changick}@kaist.ac.kr Abstract Thesedays, Convolutional Neural Networks are widely used in semantic segmentation. However, since CNN-based segmentation networks produce low-resolution outputs with rich semantic information, it is inevitable that spatial de- tails (e.g., small objects and fine boundary information) of segmentation results will be lost. To address this problem, motivated by a variational approach to image segmenta- tion (i.e., level set theory), we propose a novel loss func- tion called the level set loss which is designed to refine spa- tial details of segmentation results. To deal with multiple classes in an image, we first decompose the ground truth into binary images. Note that each binary image consists of background and regions belonging to a class. Then we convert level set functions into class probability maps and calculate the energy for each class. The network is trained to minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial details of segmentation results in a time and memory efficient way. Furthermore, our experimental results show that the proposed loss function achieves better performance than previous approaches. 1. Introduction Semantic segmentation that allocates a semantic label to each pixel in an image is one of the most challenging tasks in computer vision. However, traditional image segmenta- tion methods [3, 25, 6] are hard to address the task since they segment objects without semantic information. Con- volutional Neural Networks (CNNs) [20, 18, 33] provide a breakthrough for semantic segmentation task. Fully Con- volutional Networks (FCNs) [24, 7] based on the CNN ar- chitecture is widely used thanks to its outstanding perfor- mance on semantic segmentation. However, as mentioned in [8], there are two challenges in CNN-based semantic segmentation networks: (1) consecutive pooling or striding causes the reduction of the feature resolution; (2) the net- * These two authors contributed equally (a) (b) (c) (d) Figure 1: (a) Image. (b) Ground Truth. (c) DeepLab [7]. (d) Level set loss (Ours). Our loss alleviates the problems of semantic segmentation networks. First, proposed loss re- fines the boundaries of objects and fill missing parts of seg- mentation results (top and middle rows). Second, the level set loss encourages the network to segment small objects (bottom row). works are not aware of small objects. Using dense CRFs [19] as a post-processing step or modifying the network ar- chitecture with additional modules [24, 35, 9] are common solutions to these problems, but these approaches can be time-consuming and memory intensive [34]. To overcome the aforementioned problems, in this pa- per, we introduce a loss function that utilizes spatial corre- lation in ground truth. Until recently, most of the semantic segmentation frameworks use the cross-entropy loss. How- ever, the cross-entropy function calculates the loss at each pixel independently. This is not desirable since segmenta- tion network outputs are dense probability maps that con- tain semantic relation among pixels. We adopt the level set theory [6] to consider spatial correlation information of ground truth. However, since the conventional level set function only separates the fore- ground and background of an image (i.e., single-class level set), it is hard to apply level set in a multi-class image. To address this limitation, we separate the ground truth into the binary images of each class. We also note that the net- arXiv:1910.00950v1 [cs.CV] 2 Oct 2019

CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

CNN-based Semantic Segmentation using Level Set Loss

Youngeun Kim ∗

KAISTSeunghyeon Kim ∗

KAISTTaekyung Kim

KAISTChangick Kim

KAIST

youngeunkim, seunghyeonkim, tkkim93, [email protected]

Abstract

Thesedays, Convolutional Neural Networks are widelyused in semantic segmentation. However, since CNN-basedsegmentation networks produce low-resolution outputs withrich semantic information, it is inevitable that spatial de-tails (e.g., small objects and fine boundary information) ofsegmentation results will be lost. To address this problem,motivated by a variational approach to image segmenta-tion (i.e., level set theory), we propose a novel loss func-tion called the level set loss which is designed to refine spa-tial details of segmentation results. To deal with multipleclasses in an image, we first decompose the ground truthinto binary images. Note that each binary image consistsof background and regions belonging to a class. Then weconvert level set functions into class probability maps andcalculate the energy for each class. The network is trainedto minimize the weighted sum of the level set loss and thecross-entropy loss. The proposed level set loss improves thespatial details of segmentation results in a time and memoryefficient way. Furthermore, our experimental results showthat the proposed loss function achieves better performancethan previous approaches.

1. IntroductionSemantic segmentation that allocates a semantic label to

each pixel in an image is one of the most challenging tasksin computer vision. However, traditional image segmenta-tion methods [3, 25, 6] are hard to address the task sincethey segment objects without semantic information. Con-volutional Neural Networks (CNNs) [20, 18, 33] provide abreakthrough for semantic segmentation task. Fully Con-volutional Networks (FCNs) [24, 7] based on the CNN ar-chitecture is widely used thanks to its outstanding perfor-mance on semantic segmentation. However, as mentionedin [8], there are two challenges in CNN-based semanticsegmentation networks: (1) consecutive pooling or stridingcauses the reduction of the feature resolution; (2) the net-

∗These two authors contributed equally

(a) (b) (c) (d)

Figure 1: (a) Image. (b) Ground Truth. (c) DeepLab [7].(d) Level set loss (Ours). Our loss alleviates the problemsof semantic segmentation networks. First, proposed loss re-fines the boundaries of objects and fill missing parts of seg-mentation results (top and middle rows). Second, the levelset loss encourages the network to segment small objects(bottom row).

works are not aware of small objects. Using dense CRFs[19] as a post-processing step or modifying the network ar-chitecture with additional modules [24, 35, 9] are commonsolutions to these problems, but these approaches can betime-consuming and memory intensive [34].

To overcome the aforementioned problems, in this pa-per, we introduce a loss function that utilizes spatial corre-lation in ground truth. Until recently, most of the semanticsegmentation frameworks use the cross-entropy loss. How-ever, the cross-entropy function calculates the loss at eachpixel independently. This is not desirable since segmenta-tion network outputs are dense probability maps that con-tain semantic relation among pixels.

We adopt the level set theory [6] to consider spatialcorrelation information of ground truth. However, sincethe conventional level set function only separates the fore-ground and background of an image (i.e., single-class levelset), it is hard to apply level set in a multi-class image. Toaddress this limitation, we separate the ground truth intothe binary images of each class. We also note that the net-

arX

iv:1

910.

0095

0v1

[cs

.CV

] 2

Oct

201

9

Page 2: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

work outputs consist of the class probability maps. For eachclass, by defining the probability map as a level set func-tion for the binary image (i.e., class ground truth image),we can divide the multi-class level set problem into a num-ber of single-class level set problems. We exploit the levelset function as the loss function in the training process. Asshown in Fig. 1, segmentation results with more sophisti-cated boundaries can be achieved by minimizing our levelset loss.

Our contributions can be summarized as follows: i)We integrate the traditional level set method with a deep-learning architecture. To the best of our knowledge, ours isthe first method that applies the level set theory on a CNN-based segmentation network. ii) Compared to conventionalapproaches, the proposed end-to-end training method alle-viates the additional effort (e.g., post-processing or networkdevelopment) for preventing the reduction of spatial details.iii) The proposed level set loss achieves better performancethan previous approaches that suggest the loss function forsemantic segmentation networks. iv) To ensure the gen-erality and efficiency of our level set loss, we apply theloss to two typical CNN-based architectures, FCN [24] andDeepLab [7]. Experimental results show that the level setloss helps the segmentation network to learn spatial infor-mation of the objects.

The paper is organized as follows. Section 2 presents therelated work. Section 3 describes the details of the level setloss. Section 4 presents the experimental results on widelyused semantic segmentation networks. At the end of thispaper, we conclude with our results and point out the futurework.

2. Related Work

2.1. Level Set Methods for Image Segmentation

Active contour models, also known as snakes [17],evolve a contour to detect objects in a given image usingpartial differential equations. However, parametric snakemodels are sensitive to noise and initial contour location.Among various approaches to solve this problem, level setmethods are popular and widely used. The main idea oflevel set methods is to find the level set function that mini-mizes the energy function. As a result, the object boundarycan be obtained by the zero level set. Previous methods[27, 2] calculate the energy function based on the edge in-formation, which is usually sensitive to noise. To addressthis problem, a region-based level set method [6] proposesthe energy function that calculates the sum of pixel-intensityvariance at the contour inside and outside. The level setfunction is updated by iterative minimization of the energyfunction. Finally, the zero level set represents the objectboundary. Since the region-based level set method is robustto image conditions such as noise or initial contour, this ap-

proach shows superior segmentation results than the priormethods.

2.2. Semantic Segmentation

Thanks to advances of convolutional neural networks(CNNs), recent works achieve highly accurate results evenin complex images [12, 22]. However, there are two ma-jor problems with deep learning based segmentation ap-proaches. One is the reduction of feature resolution dueto consecutive pooling layers. Another is the object sizevariability in realistic and complex images [8].

Recent works propose ways to deal with these prob-lems. FCN [24] is aimed to address the problems by usingencoder-decoder structure. The decoder part recovers theobject details and spatial information. U-Net [30] attachesskip connections between the features of encoder and de-coder. SegNet [4] stores pooling indices and reuses themin the decoder part. DeepLab-v2 [7] proposes atrous spatialpyramid pooling that adds multi-scale information from aparallel structure. PSPNet [35] improves the performancevia pyramid pooling module, which extracts global contextinformation by aggregating different region based contexts.Deeplab-v3+ [10] attaches decoder part after atrous spatialpyramid pooling module. The encoder part with atrous con-volution extracts rich semantic features and the decoder partrecovers the object boundary.

With the development of networks, several techniquesare also proposed to handle the challenges in semantic seg-mentation. Pre-trained weight initialization is pretty im-portant to produce a satisfying performance. This tech-nique gives the network prior knowledge of object classi-fication. Most semantic segmentation networks use pre-trained weight on ImageNet [31], and some of them alsouse COCO [22] and JFT [15] for a higher score. Data aug-mentation is also widely used to increase the amount of dataand to avoid local minima. These techniques are more crit-ical for semantic segmentation due to lack of data. Condi-tional Random Fields (CRFs) [7, 19] are the common post-processing method for boundary refinement. This methodimproves performance especially along the boundaries ofthe objects, but is sensitive to their parameters. Also, multi-scale inputs (MSC) [9] provide performance increase. Formulti-scale processing, inputs are given to the network atscale = 0.5, 0.75, 1 and the output is determined by se-lecting the maximum output value across scales for eachpixel.

2.3. Level Set with Deep Learning Framework

Few works adopt the idea of the level set into the deeplearning framework. In other fields, [16] proposes a deeplevel set network for saliency object detection. They use thelevel set theory to refine the saliency maps. Furthermore,they apply super-pixel filtering to help the refinement. Since

Page 3: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

Figure 2: The training scheme of our level set loss. Our loss can be applied in an arbitrary CNN-based segmentation network.The output probability maps are shifted into [-0.5, 0.5] so that it works as the level set function φ. We also decompose theground truth into binary maps of each class (white is 1 and black is 0). For each class, we calculate the level set energy. Wetreat the sum of energy as the loss function for training the segmentation network.

their method works only with saliency maps, the approachis hard to apply in multi-class and multi-object images.

For semantic segmentation, [32] proposes ContextualRecurrent Level Set (CRLS). In this work, the curve evo-lution is presented in a time series, and the level set methodis reformulated as Recurrent Neural Network (RNN). Sincethe level set method is hardly applied in multi-class images,they utilize an object detection network to obtain single ob-ject images. While this method needs an auxiliary network,our approach can improve the performance without any ad-ditional networks or architectural changes.

2.4. Loss Functions for CNNs

Deep learning framework usually uses the cross-entropyloss for a cost function. The cross-entropy loss gives sat-isfying results on various tasks (e.g., classification, ob-ject detection, semantic segmentation). However, there aresome missing points with the cross-entropy loss when it isused for object detection or semantic segmentation. Re-cently, there have been several attempts to increase the per-formance of scene understanding by adding loss terms or

changing the structure of loss layers.In object detection, [23] suggests Focal Loss to solve the

extreme foreground-background class imbalance problemin one-stage object detection. Their Focal Loss adopts thecross-entropy loss to focus on hard negative examples. WithRetinaNet proposed in [23], their loss shows the improve-ment in object detection area. For semantic segmentationtask, [5] also tries to deal with the class imbalance prob-lem. By applying a max-pooling concept to the loss, theyre-initialize the weights of the individual pixel based on thevalue of loss functions. [13] addresses the problem of thepixel-wise loss as we suggest in this paper. By using the lo-cally adaptive learning estimator, they enforce the networkto learn the inter and intra class discrimination.

3. Our MethodIn this section, we introduce our novel loss for CNN-

based semantic segmentation networks. We first review theclassic level set method [6], which is important to under-stand our work. Then, our proposed loss is described indetails.

Page 4: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

3.1. Region-based Level Set Methods

The authors of [6] propose the region-based level setmethod for image segmentation. The method minimizes theenergy function which is defined by

F (c1, c2, φ) = µ · Length(φ) + ν ·Area(φ)

+ λ1

∫Ω

|u0(x, y)− c1|2H(φ(x, y)) dx dy

+ λ2

∫Ω

|u0(x, y)− c2|2 (1−H(φ(x, y))) dx dy,

(1)where µ ≥ 0, ν ≥ 0, λ1, λ2 > 0 are fixed parameters, Ω isthe entire domain of the given image, u0(x, y) is the pixelvalue at location (x, y) ∈ Ω, and φ is the level set function.Length(φ) and Area(φ) are regularization terms with re-spect to the length and the inside area of the contour. H isthe Heaviside Function (HF),

H(z) =

1, z ≥ 0

0, z < 0.(2)

c1 and c2 are constant functions of φ that indicate the meanpixel value of interior and exterior of the contour respec-tively.

c1(φ) =

∫Ωu0(x, y)H(φ(x, y)) dx dy∫

ΩH(φ(x, y)) dx dy

,

c2(φ) =

∫Ωu0(x, y) (1−H(φ(x, y))) dx dy∫

Ω(1−H(φ(x, y))) dx dy

.

(3)

3.2. Deep Level Set Loss

We utilize the energy function of [6] for the semanticsegmentation task with deep learning framework. Sincethe classic level set method is limited in representingsemantic information, we decompose multi-class semanticsegmentation into several single-class segmentation by re-constructing the dense binary ground truth for each object.For a given input image, let G be the given segmentationground truth and L be the set of classes that exist in theimage. We generate the reconstructed ground truth Gl forclass l ∈ L by remaining object region of concern in Gas foreground and replacing the others with background.Note that we also generate binary dense ground truth G0

for background class in L.

To apply the classic level set method for deep learning,we first set the parameters in eq. (1) as done in [6], ν =0, λ1 = λ2 = 1. Also, as mentioned in [6], the Length(φ)term is sensitive to the size of the object. Since the inputimages have multiple size objects, we set µ as zero. Ourproposed level set loss is formulated as follows:

(a) Image (b) Ground Truth

(c) CE loss only (d) with LS loss

Figure 3: Visualization of the level set function (i.e., prob-ability map). We show the probability map of class Bird(blue is 0 and red is 1). Compared to (c), the network trainedwith level set loss (d) provides clearer boundary. Best viewin color.

ELS(φ, G) =

∑l∈L

( ∫Ωl

|Gl(x, y)− cl,1|2H∗ε (φl(x, y)) dx dy

+

∫Ωl

|Gl(x, y)− cl,2|2(1−H∗ε (φl(x, y))) dx dy

),

(4)where Ωl is an entire domain of Gl, and the level set

function φ is a shifted dense probability map that is es-timated from the segmentation network e.g., φl(x, y) =Pl(x, y) − 0.5 ∈ [−0.5, 0.5] with the output probabilitymap Pl for class l. Note that we apply the energy functionon the reconstructed dense ground truth for each existingclass instead of the input image. Since the objects in an im-age may have high color variance, it is not desirable to applythe level set function. So, we replace u0 to Gl for a reliabletraining process. cl,1 and cl,2 represent average intensity ofbinary ground truth map Gl for contour inside and outside.

cl,1(φ) =

∫ΩGl(x, y)H∗

ε (φl(x, y)) dx dy∫ΩH∗ε (φl(x, y)) dx dy

,

cl,2(φ) =

∫ΩGl(x, y) (1−H∗

ε (φl(x, y))) dx dy∫Ω

(1−H∗ε (φl(x, y))) dx dy

.

(5)

In our method, we propose the Modified ApproximatedHeaviside Function (MAHF) H∗. We note that using tanhas an activation function shows high performance for deeplearning architectures [21, 29]. Therefore, contrary to origi-nal AHF proposed in [6], we adopt tanh to achieve the simi-lar effect as using Heaviside Function, eq. (2). As expected,we observe that employing tanh performs better than using

Page 5: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

(a) CE loss (b) LS loss (c) mIOU

Figure 4: Three graphs of two losses and mIOU over training iteration. (a) As expected, the cross-entropy loss is decreased.(b) LS loss is also decreased with increasing the mIOU score. (c) A model with higher mIOU has lower LS loss.

HF and original AHF.

H∗ε (z) =

1

2(1 + tanh(

z

ε)). (6)

Figure 2 shows the whole training scheme for calculatingthe level set loss. The network can be end-to-end trained us-ing backpropagation with our proposed loss. Since our lossfunction is differentiable with respect to network outputs,as shown in eq. (7), gradients are available for all trainingprocesses.

∂ELS∂φl

= δ∗ε (φl)

[(Gl − cl,1)2 − (Gl − cl,2)2

],

δ∗ε (z) =∂H∗

ε (z)

∂z=

1

2ε(1− tanh(

z

ε))(1 + tanh(

z

ε)).

(7)To help the network to learn more discriminative features

of objects, we combine our loss function with the cross-entropy loss as follows:

Loss = ECE(P, G) + λ · ELS(φ, G), (8)

where λ is a parameter for weighting the level set loss. Witheq. (8), the level set loss decreases as mIOU increases, asshown in Fig. 4 (b) and (c). Also, CE loss with LS lossreaches lower value than that of baseline (without LS loss),as shown in Fig. 4 (a). These statistics imply that the levelset loss trains the network in a different way from CE loss.This is possible due to the property of the energy function,eq. (1), which considers overall spatial information of animage.

4. Experimental ResultsIn this section, we show experimental results of pro-

posed level set loss. We compared our method with dif-ferent approaches (e.g., encoder- decoder structure network

and CRFs). We selected FCN [24] as the representative ar-chitecture for experimenting on the encoder-decoder struc-ture network and DeepLab [7] as the representative archi-tecture for experimenting on the network with CRFs aspost-processing. The comparison with previous similar ap-proaches (i.e., proposing loss function for semantic segmen-tation) are shown in Section 4.6. Our proposed loss achievesbetter segmentation results than others. We present analy-ses of the level set loss on hyperparameters and MAHF inSection 4.7.

4.1. Evaluation Metric and Dataset

The performance was measured in terms of pixelintersection-over-union (IOU) averaged across the everyclasses. We carry out experiments on three semantic seg-mentation public datasets (PASCAL VOC 2012, PASCAL-Context, Cityscapes).

4.1.1 PASCAL VOC 2012

We evaluated our framework on the PASCAL VOC 2012dataset [12], which are comprised of 20 foreground objectclasses and one background class. Like previous works,we used extra annotations [14] so that the dataset contains10,582 training, 1,499 validation and 1,456 test images. Theexperimental results are reported in Table 1.

4.1.2 PASCAL-Context

The PASCAL-Context [26] dataset is another populardataset for semantic segmentation evaluation. The datasetcontains 59 object classes and one background class. Weuse 4998 images for training and 5105 images for testing.The results are shown in Table 2.

Page 6: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

Method Baseline CRFs LS lossFCN-32s-ResNet101 65.1 - 68.9FCN-16s-ResNet101 68.4 - 69.2FCN-8s-ResNet101 68.5 - 69.3DeepLab-LargeFOV 62.8 67.3 66.7DeepLab-MSC-LargeFOV 64.2 68.1 67.2DeepLab-ResNet101 75.1 76.3 76.5DeepLab-MSC-ResNet101 76.0 77.7 77.3

Table 1: Performance comparison of our level set loss onrepresentative architectures. We train the FCN-ResNet101[24] and DeepLab [7]. The Pascal VOC 2012 validation setis used for evaluation.

4.1.3 Cityscapes

The Cityscapes dataset [11] consists of street scene imagesfrom 50 different cities. The dataset contains pixel-levelannotations of cars, roads, pedestrians, motorcycles, etc. Intotal the dataset considers 19 classes. The dataset provides2975 images for training and the 500 images for validating.The results are shown in Table 3.

4.2. Implementation Details

Our implementation is based on the Pytorch [28] and theTensorflow [1]. We trained the network by the standardSGD algorithm of [18]. Also, ResNet-based network waspre-trained with MS-COCO [22]. For FCN, initial learn-ing rate, batch and crop size were set to (0.004, 10, 512).The network was trained with the momentum 0.9 and theweight decay 0.0001. Random horizontal flip was used fordata augmentation. As training DeepLab framework, initiallearning rate, batch and crop size were set to (0.004, 10,512) and (0.00025, 10, 321) for DeepLab-LargeFOV andDeepLab-ResNet101, respectively. To train VGG16-basednetwork (i.e., DeepLab-LargeFOV), we set the momentum0.9, the weight decay 0.0001 and we applied horizontalflip to input images. Otherwise, DeepLab-ResNet101 wastrained with the momentum 0.9, the weight decay 0.0005and we applied random crop and horizontal flip to inputimages. In addition, we used the “poly” learning rate pol-icy where the current learning ratio equals initial learningrate multiplied by (1 − iter

max iter )0.9 for training DeepLab-ResNet101.

4.3. Running Time for Level Set Loss

We report execution time on NVIDIA GTX 1080Ti andINTEL i7-8700 processor with 16G RAM. To measuretraining time, we adopt DeepLab-largeFOV as a baselinenetwork. In training phase, it took 0.133 sec/image with thecross-entropy only. With our level set loss, it took 0.157sec/image.

Method Baseline CRFs LS lossFCN-8s-ResNet101 41.0 - 42.2DeepLab-LargeFOV 37.1 39.7 39.4DeepLab-ResNet101 44.7 45.7 45.5

Table 2: mIOU performance on the Pascal-Context test set.

Method Baseline CRFs LS LossFCN-8s-ResNet101 64.7 - 65.5DeepLab-LargeFOV 62.9 64.1 64.0DeepLab-ResNet101 68.8 69.8 69.8

Table 3: mIOU performance on the Cityscapes validationset.

4.4. Effects of Level Set Loss

4.4.1 Fully Convolutional Networks (FCN)

Table 1 summarizes segmentation results (mIOU) of FCNsbased on ResNet101 with and without LS loss. The base-lines were trained only with the cross-entropy loss. Us-ing the proposed LS loss consistently yielded performancegains over different FCNs. Especially, FCN-32s-ResNet101with LS loss achieved mIoU of 68.9%, which outperformsthe baseline by 3.8%. Note that despite of the systematiclow resolution, mIoU of FCN-32s-ResNet101 with LS lossexceeds that of FCN-8s-ResNet101 baseline by 0.4%. Itshows that the LS loss not only guides the object boundarybut also encourages learning spatial correlation information.Qualitative results in Fig. 5 further prove the spatial contextawareness of the network trained by LS loss.

4.4.2 DeepLab

For DeepLab implementation, we used the same trainingscheme as [7]. To show generality of our loss function, weevaluated our loss on DeepLab-LargeFOV and DeepLab-ResNet101. 1 As shown in Table 1, using the LS loss onDeepLab-LargeFOV increases the performance about 4%.For DeepLab-ResNet101, LS loss achieves about 1% mIOUimprovement. In our experimental results, employing theLS loss achieves comparable performance to using CRFsin DeepLab frameworks. As shown in Fig. 7, the resultsof CRFs show more accurate object boundaries than the re-sults of the LS loss. However, LS loss supervises the net-work to detect complex or small objects. Furthermore, ourloss performs end-to-end training, while CRFs are used as apost-processing method which requires extra computation.

Page 7: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

DeepLab time : (sec/img) LS loss CRFs-iter(5) CRFs-iter(10)

LargeFOVtime 0.025 0.272 0.387

relative time (×1) (×10.8) (×15.4)mIOU 66.7 66.5 67.3

ResNet101time 0.054 0.315 0.452

relative time (×1) (×5.8) (×8.4)mIOU 76.5 76.1 76.3

Table 4: Runtime comparison between using level set lossand using CRFs on the Pascal 2012 validation dataset.iter(k) means that the number of iteration in CRFs.

4.5. Comparison with CRFs

Using CRFs gives similar performance to ours as shownin Table 1, 2, 3. However, using CRFs as post-processingrequires extensive computational time. On the other hand,our algorithm is end-to-end approach and does not haveadditional computation at inference time. We experimen-tally compare the time consumption between using CRFsand the level set loss. In Table 4, the results show that ourmethod achieves a speed increase, which is higher than 5×compared to CRFs. Furthermore, the performance of ourmethod is better than using CRFs with less iteration.

4.6. Comparison with Other Loss Functions

There are only a few works that propose the loss func-tion for semantic segmentation task. For a fair comparisonwith previous loss functions, we presented the results onDeepLab-ResNet101. Here, we do not use both multi-scaleinput (MSC) and CRFs, as done in [5] and [13]. We alsoused the Pascal VOC 2012 segmentation validation datasetfor evaluation.

As shown in Table 5, the network trained with our levelset loss is better than previous methods. Our loss function(we use ε = 1/20, λ = 4 × 10−4 for hyper parameters)shows much improved mIOU that is 1.4% higher than thebaseline. The closest competing method is LMP [5], whichachieves 1.2% higher than the baseline. By training the net-work with spatial information of ground truth, our level setloss significantly boosts the mIOU.

4.7. Analysis of the MAHF

In our work, we propose the Modified ApproximatedHeaviside Function (MAHF) to apply the level set theoryin deep learning. We also compared the performance ofMAHF with HF and AHF [6]. For comparison, ε for AHFset to 20 so that the network shows the best performance.Figure 5 shows the comparison between various HeavisideFunction. Note that HF shows inferior performance sinceit is prone to be stuck in local minima. MAHF and AHF

Method mIOUDeepLab-ResNet101 (Baseline) 75.1LAD [13] 76.1LMP [5] 76.3Level Set Loss (Ours) 76.5

Table 5: Performance comparison with other loss functionsfor semantic segmentation networks. We present the mIOUreported in [5] and [13]. The Pascal VOC 2012 validationset is used for evaluation.

Figure 5: Comparison between Modified ApproximatedHeaviside Function (MAHF : black), Approximated Heav-iside Function (AHF : red) and Heaviside Function (HF :blue). DeepLab-LargeFOV and the PASCAL VOC valida-tion set are used for the comparison.

achieve comparable performance, but MAHF gives slightlybetter results than AHF.

5. ConclusionIn this paper, we have proposed the level set loss for

CNN-based semantic segmentation. Compared to the exist-ing cross-entropy loss, our proposed loss considers spatialinformation of ground truth. The network trained with thelevel set loss represents the spatial information better andalleviates the typical problems of semantic segmentation.Experimental results on representative networks, FCN andDeepLab, verify the generality of the loss. Furthermore,unlike previous works, the level set loss does not requireadditional architectures, computational cost, and data.

For future work, we intend to develop loss functions thatconcern the spatial information of ground truth. The levelset loss is just one of them. Our work notices the worthof designing appropriate losses to implement high perfor-mance segmentation networks.

ACKNOWLEDGMENTThis work was supported by the National Research

Foundation of Korea (NRF) grant funded by the Korea gov-ernment (MSIT) (No. 2018025409).

Page 8: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

(a) Image (b) GroundTruth (c) FCN 32s (d) FCN 8s (e) 32s + LS loss

Figure 6: Qualitative results of FCN-Resnet101 on the Pascal VOC 2012 val set. Our proposed level set loss regularizethe segmentation network to represent global context information. The trained FCN 32s with our level set loss shows betterperformance than FCN 8s.

(a) Image (b) GroundTruth (c) DeepLab (d) with CRFs (e) with LS loss

Figure 7: Qualitative results of DeepLab-ResNet101 on the Pascal VOC 2012 val set. (d) is the result of using CRFs aspost-processing. (e) is the results of the end-to-end training with our level set loss (No post-processing).

Page 9: CNN-based Semantic Segmentation using Level Set Lossto minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial

References[1] M. Abadi et al. Tensorflow: a system for large-scale machine

learning. In OSDI. 2018.[2] D. Adalsteinsson and J. A. Sethian. A fast level set method

for propagating interfaces. Journal of computational physics,118(2):269–277, 1995.

[3] S. S. Al-Amri and N. V. Kalyankar. Image segmentation byusing threshold techniques. arXiv:1005.4020, 2010.

[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for imagesegmentation. arXiv:1005.4020, 2010.

[5] S. R. Bulo, G. Neuhold, and P. Kontschieder. Loss maxpool-ing for semantic image segmentation. In CVPR. 2017.

[6] T. Chan and L. Vese. Active contours without edges. IEEETransactions on image processing, 10(2):266–277, 2001.

[7] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. IEEE transactions on pattern analysis and ma-chine intelligence, 40(4):834–848, 2018.

[8] L. C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion. arXiv:1706.05587, 2017.

[9] L. C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. At-tention to scale: Scale-aware semantic image segmentation.In CVPR. 2016.

[10] L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. arXiv:1802.02611, 2018.

[11] M. Cordts et al. The cityscapes dataset for semantic urbanscene understanding. In CVPR. 2016.

[12] M. Everingham, L. V. Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International journal of computer vision, 88(2):303–338, 2010.

[13] J. Guo, P. Ren, A. Gu, J. Xu, and W. Wu. Locallyadaptive learning loss for semantic image segmentation.arXiv:1802.08290, 2018.

[14] B. Hariharan, P. Arbelez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. In ICCV. 2011.

[15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv:1503.02531, 2015.

[16] P. Hu, B. Shuai, J. Liu, and G. Wang. Deep level sets forsalient object detection. In CVPR. 2017.

[17] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Activecontour models. International journal of computer vision,1(4):321–331, 1988.

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems. 2012.

[19] P. Krhenbhl and V. Koltun. Efficient inference in fully con-nected crfs with gaussian edge potentials. In NIPS. 2011.

[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

[21] Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. Mller. Efficientbackprop. Neural networks: Tricks of the trade, pages 9–48,2012.

[22] T. Y. Lin et al. Microsoft coco: Common objects in context.In ECCV. 2014.

[23] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr. Focalloss for dense object detection. IEEE transactions on patternanalysis and machine intelligence, 2018.

[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR. 2015.

[25] F. C. Monteiro and A. Campilho. Watershed framework toregion-based image segmentation. In ICPR. 2008.

[26] R. Mottaghi et al. The role of context for object detectionand semantic segmentation in the wild. In CVPR. 2014.

[27] S. Osher and J. A. Sethian. Fronts propagating withcurvature-dependent speed: algorithms based on hamilton-jacobi formulations. Journal of computational physics 79.1,79(1):12–49, 1988.

[28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. 2017.

[29] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. arXiv:1511.06434, 2015.

[30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-tional networks for biomedical image segmentation. In InInternational Conference on Medical image computing andcomputer-assisted intervention. 2015.

[31] O. Russakovsky et al. Imagenet large scale visual recog-nition challenge. International journal of computer vision,115(3):211–252, 2015.

[32] O. Russakovsky et al. Reformulating level sets as deep re-current neural network approach to semantic segmentation.IEEE Transactions on Image Processing, 27(5):2393–2407,2018.

[33] K. Simonyan and A. Zisserman. Very deep con-volutional networks for large-scale image recognition.arXiv:1409.1556, 2014.

[34] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Distill-ing the knowledge in a neural network. arXiv:1804.09337,2018.

[35] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In CVPR. 2017.