5
EdgeSegNet: A Compact Network for Semantic Segmentation Zhong Qiu Lin 12 Brendan Chwyl 2 Alexander Wong 132 Abstract In this study, we introduce EdgeSegNet, a com- pact deep convolutional neural network for the task of semantic segmentation. A human-machine collaborative design strategy is leveraged to cre- ate EdgeSegNet, where principled network de- sign prototyping is coupled with machine-driven design exploration to create networks with cus- tomized module-level macroarchitecture and mi- croarchitecture designs tailored for the task. Ex- perimental results showed that EdgeSegNet can achieve semantic segmentation accuracy compa- rable with much larger and computationally com- plex networks (>20× smaller model size than Re- fineNet) as well as achieving an inference speed of 38.5 FPS on an NVidia Jetson AGX Xavier. As such, the proposed EdgeSegNet is well-suited for low-power edge scenarios. 1. Introduction A challenging task in the realm of computer vision is seman- tic segmentation, where the goal is to assign a class label (e.g., road, car, person, etc.) to each pixel of an image. A lot of recent successes in the realm of semantic segmentation has centered around deep learning (LeCun et al., 2015), par- ticularly leveraging deep convolutional neural networks to learn the mapping between input images and output seman- tic segmentation label maps. Some notable state-of-the-art deep convolutional neural network architectures previously proposed in research literature include RefineNet (Lin et al., 2017), TuSimple (Wang et al., 2017), PSPNet (Zhao et al., 2017), and the DeepLab family of networks (Chen et al., 2018a; 2017; 2018b). Despite these significant advances in deep convolutional neural networks for the task of semantic segmentation over recent years, the high architectural and computational com- * Equal contribution 1 University of Waterloo, Waterloo, ON, Canada 2 DarwinAI Corp., Waterloo, ON, Canada 3 Waterloo Artifi- cial Intelligence Institute, Waterloo, ON, Canada. Correspondence to: Alexander Wong <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). plexities of such networks pose a big challenge for the widespread deployment in practical, on-device edge scenar- ios such as on mobile devices, drones, and vehicles where computational, memory, bandwidth, and energy resources are very limited. Therefore, one is motivated to investigate the design of compact deep convolutional neural networks for semantic segmentation tailored for such low-power edge scenarios. A number of interesting strategies have been proposed in research literature for producing compact deep neural net- works that are more catered for low-power on-device usage. These strategies include precision reduction (Jacob et al., 2017; Meng et al., 2017; Courbariaux et al., 2015), model compression (Han et al., 2015; Hinton et al., 2015; Ravi, 2017), architectural design principles (Howard et al., 2017; Sandler et al., 2018; Iandola et al., 2016; Shafiee et al., 2017; Wong et al., 2018b; Zhang et al., 2017; Ma et al., 2018; He et al., 2015). More recently, an interesting new strategy explored by researchers is the notion of fully automated network architecture search for algorithmically exploring compact deep neural network architecture designs that are better suited for on-device edge and mobile usage. Exem- plary automated network architecture search strategies in this direction include MONAS (Hsu et al., 2018), Paret- oNASH (Elsken et al., 2018), and MNAS (Tan et al., 2018), which take computational constraints into account during the search process. In this study, we introduce EdgeSegNet, a compact deep convolutional neural network for the task of semantic seg- mentation. This is accomplished via a human-machine col- laborative design strategy, where human-driven principled network design prototyping is coupled with machine-driven design exploration. Such an approach leads to customized module-level macroarchitecture and microarchitecture de- signs tailored specifically for semantic segmentation in low- power edge scenarios. 2. Methods Here, we introduce EdgeSegNet, a compact deep convolu- tional neural network for semantic segmentation that was created via a human-machine collaborative design strat- egy (Wong et al., 2019). To leverage this human-machine collaborative design strategy for building EdgeSegNet, we arXiv:1905.04222v1 [cs.CV] 10 May 2019

EdgeSegNet: A Compact Network for Semantic Segmentation · EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ’) based on human-driven

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EdgeSegNet: A Compact Network for Semantic Segmentation · EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ’) based on human-driven

EdgeSegNet: A Compact Network for Semantic Segmentation

Zhong Qiu Lin 1 2 Brendan Chwyl 2 Alexander Wong 1 3 2

AbstractIn this study, we introduce EdgeSegNet, a com-pact deep convolutional neural network for thetask of semantic segmentation. A human-machinecollaborative design strategy is leveraged to cre-ate EdgeSegNet, where principled network de-sign prototyping is coupled with machine-drivendesign exploration to create networks with cus-tomized module-level macroarchitecture and mi-croarchitecture designs tailored for the task. Ex-perimental results showed that EdgeSegNet canachieve semantic segmentation accuracy compa-rable with much larger and computationally com-plex networks (>20× smaller model size than Re-fineNet) as well as achieving an inference speedof ∼38.5 FPS on an NVidia Jetson AGX Xavier.As such, the proposed EdgeSegNet is well-suitedfor low-power edge scenarios.

1. IntroductionA challenging task in the realm of computer vision is seman-tic segmentation, where the goal is to assign a class label(e.g., road, car, person, etc.) to each pixel of an image. A lotof recent successes in the realm of semantic segmentationhas centered around deep learning (LeCun et al., 2015), par-ticularly leveraging deep convolutional neural networks tolearn the mapping between input images and output seman-tic segmentation label maps. Some notable state-of-the-artdeep convolutional neural network architectures previouslyproposed in research literature include RefineNet (Lin et al.,2017), TuSimple (Wang et al., 2017), PSPNet (Zhao et al.,2017), and the DeepLab family of networks (Chen et al.,2018a; 2017; 2018b).

Despite these significant advances in deep convolutionalneural networks for the task of semantic segmentation overrecent years, the high architectural and computational com-

*Equal contribution 1University of Waterloo, Waterloo, ON,Canada 2DarwinAI Corp., Waterloo, ON, Canada 3Waterloo Artifi-cial Intelligence Institute, Waterloo, ON, Canada. Correspondenceto: Alexander Wong <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

plexities of such networks pose a big challenge for thewidespread deployment in practical, on-device edge scenar-ios such as on mobile devices, drones, and vehicles wherecomputational, memory, bandwidth, and energy resourcesare very limited. Therefore, one is motivated to investigatethe design of compact deep convolutional neural networksfor semantic segmentation tailored for such low-power edgescenarios.

A number of interesting strategies have been proposed inresearch literature for producing compact deep neural net-works that are more catered for low-power on-device usage.These strategies include precision reduction (Jacob et al.,2017; Meng et al., 2017; Courbariaux et al., 2015), modelcompression (Han et al., 2015; Hinton et al., 2015; Ravi,2017), architectural design principles (Howard et al., 2017;Sandler et al., 2018; Iandola et al., 2016; Shafiee et al., 2017;Wong et al., 2018b; Zhang et al., 2017; Ma et al., 2018; Heet al., 2015). More recently, an interesting new strategyexplored by researchers is the notion of fully automatednetwork architecture search for algorithmically exploringcompact deep neural network architecture designs that arebetter suited for on-device edge and mobile usage. Exem-plary automated network architecture search strategies inthis direction include MONAS (Hsu et al., 2018), Paret-oNASH (Elsken et al., 2018), and MNAS (Tan et al., 2018),which take computational constraints into account duringthe search process.

In this study, we introduce EdgeSegNet, a compact deepconvolutional neural network for the task of semantic seg-mentation. This is accomplished via a human-machine col-laborative design strategy, where human-driven principlednetwork design prototyping is coupled with machine-drivendesign exploration. Such an approach leads to customizedmodule-level macroarchitecture and microarchitecture de-signs tailored specifically for semantic segmentation in low-power edge scenarios.

2. MethodsHere, we introduce EdgeSegNet, a compact deep convolu-tional neural network for semantic segmentation that wascreated via a human-machine collaborative design strat-egy (Wong et al., 2019). To leverage this human-machinecollaborative design strategy for building EdgeSegNet, we

arX

iv:1

905.

0422

2v1

[cs

.CV

] 1

0 M

ay 2

019

Page 2: EdgeSegNet: A Compact Network for Semantic Segmentation · EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ’) based on human-driven

EdgeSegNet: A Compact Network for Semantic Segmentation

Input 256x256x3

BottleneckModule

Scaling: x1/8

BilinearResize

Scaling: x2+

RefineModule

Scaling: x1

BilinearResize

Scaling: x2+

RefineModule

Scaling: x1

BilinearResize

Scaling: x4Conv

k:1x1; s:1x1Output

256x256x32

MaxPool k:3x3; s:2x2

ResidualModule

Scaling: x1/2

RefineModule

Scaling: x1

RefineModule

Scaling: x1

128x128x13 32x32x193

64x64x13

64x64x217

Conv k:7x7; s:2x2

16x16x193 32x32x193 64x64x217 256x256x217

32x32x193

(a) EdgeSegNet Network ArchitectureConv

k:1x1; s1x1 Conv

k:3x3; s1x1 Conv

k:1x1; s1x1 BN/

ReLuBN/

ReLu

BN/ ReLu Conv

k:3x3; s1x1

+

(b) Residual Bottleneck Module

Conv k3x3; s8x8 ReLu Conv

k1x1; s1x1 ReLu Conv k3x3; s1x1

(c) Bottleneck Reduction Module

Conv k:1x1; s1x1

Conv k:3x3; s1x1

Conv k:3x3; s1x1 ReLu ReLu

+Conv k:1x1; s1x1

(d) Refine Module

Figure 1. The network architecture of EdgeSegNet network for semantic segmentation. The underlying architecture is comprised of aheterogeneous mix of residual bottleneck macroarchitectures and non-residual bottleneck macroarchitectures with unique module-levelmicroarchitecture designs. Also notable are selective use of long-range shortcut connectivity, and aggressive reduction via stridedconvolutions.

first perform principled network design prototyping to con-struct an initial design prototype to act as the base frame-work. Next, we conduct machine-driven design explorationbased on this initial design prototype along with accompa-nying data and design requirements. We will now discusseach of these design stages, followed by the EdgeSegNetarchitecture design.

2.1. Principled network design prototypingAt the principled network design prototype stage in creatingEdgeSegNet, we construct an initial semantic segmentationnetwork design prototype (denoted as ϕ) based on human-driven design principles to act a guide for the machine-driven design exploration phase. Inspired by the designprinciples for building networks for the task of semanticsegmentation proposed in (Lin et al., 2017), we constructthe initial design prototype with a multi-path refinementnetwork architecture that enables improved high-resolutionprediction by leveraging long-range shortcut connections.Such long-range shortcut connections enable the high-levelsemantic modeling in the deep layers to be refined based onfine-grained modeling in the earlier layers.

More specifically, the initial multi-path refinement designprototype for semantic segmentation used in this study iscomprised of a number of feature representation modules,with shortcut connections between the modules. Refinemodules are interspersed between these feature represen-tation modules to enable outputs of the deep layers to berefined based on that of earlier layers. The actual macroar-chitecture and microarchitecture designs of the individualnetwork modules in the semantic segmentation network ar-chitecture are left flexible in order for the machine-drivendesign exploration phase to determine automatically basedon the given dataset along with human-specified designrequirements catered for on-device edge scenarios wherecomputational and memory complexity are highly limited.

2.2. Machine-driven design explorationGiven the initial network design ϕ, the module-levelmacroarchitecture and microarchitecture designs of the pro-posed EdgeSegNet network architecture is then determinedvia a machine-driven design exploration stage in our designprocess based on the segmentation data at hand as well ashuman-specified requirements. This machine-driven designexploration stage ensures that the generated microarchitec-ture and macroarchitecture designs produced by machine-driven design exploration are well-suited for on-device se-mantic segmentation for edge scenarios.

For the purpose of machine-driven design exploration, itis accomplished in the form of generative synthesis (Wonget al., 2018a) to determine fine-grain macroarchitecture andmicroarchitecture designs of the individual network modulesof the EdgeSegNet network architecture based on data andhuman-specified design requirements and constraints. Theunderlying premise behind generative synthesis is to learna generator G that, given a set of seeds S, can generate net-works {Ns|s ∈ S} that maximize a universal performancefunction U (e.g., (Wong, 2018)) while satisfying require-ments defined via an indicator function 1r(·). This can beformulated as a constrained optimization problem,

G = maxGU(G(s)) subject to 1r(G(s)) = 1, ∀s ∈ S.

(1)An approximate solution G to the constrained optimizationproblem posed in Eq. 1 can be obtained via iterative opti-mization, with the initial solution (i.e., G0) initialized basedon ϕ, U , and 1r(·), and each successive solution Gk achiev-ing a higher U than its predecessor generators (i.e., G1, . . .,Gk−1, etc.) while constrained by 1r(·). The resulting solu-tion G can be thus used to generate the final EdgeSegNetnetwork that satisfies 1r(·).

Here, we configure the indicator function 1r(·) such thatthe accuracy ≥ 88% on Cambridge-driving Labeled Video

Page 3: EdgeSegNet: A Compact Network for Semantic Segmentation · EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ’) based on human-driven

EdgeSegNet: A Compact Network for Semantic Segmentation

Database (CamVid) (Brostow et al., 2008), a dataset intro-duced for evaluating semantic segmentation with 32 differ-ent semantic classes, so that it is within 3% of ResNet-101RefineNet (Lin et al., 2017), a state-of-the-art network.

3. EdgeSegNet Architectural DesignThe network architecture of the proposed EdgeSegNet forsemantic segmentation is shown in Fig. 1a. A number ofinteresting observations can be made about the module-level macroarchitecture design of the customized modulesof EdgeSegNet that was created via a human-machine col-laborative design strategy.

3.1. Macroarchitecture heterogeneityThe most obvious and notable observation about the pro-posed EdgeSegNet network architecture is that it is com-prised of a heterogeneous mix of residual bottleneckmacroarchitectures with shortcut connections and non-residual bottleneck macroarchitectures. The use of bot-tleneck macroarchitectures enables channel dimensionalityto be decreased at a compression convolutional layer using1×1 convolutions before being restored at a later convolu-tional layer, thus reducing the architectural and computa-tional complexity of the network while preserving modelingperformance.

3.2. Selective long-range shortcut connectivityThe second notable observation about the proposed Edge-SegNet network architecture is that long-range shortcut con-nections only exist for a subset of possible combinations oflayers, leading to only some of the high-level semantic mod-eling at the deep layers being refined based on fine-grainedmodeling at the earlier layers. Not only does this reduc-tion in long-range shortcut connectivity reduce architecturalcomplexity of the network, but also may indicate that theremay only be benefits to refining certain scales.

3.3. Aggressive reduction via strided convolutionsThe third notable observation about the proposed EdgeSeg-Net network architecture is that the non-residual bottleneckreduction module macroarchitecture leverages 8×8 stridedconvolutions, and as such achieves very aggressive reductionof spatial dimensionality into the next layer. This dimen-sionality reduction property of the non-residual bottleneckreduction module macroarchitecture significantly reducesarchitectural and computational complexity of the network.

4. Results and DiscussionThe efficacy of the proposed EdgeSegNet for seman-tic segmentation in on-device edge scenarios was evalu-ated using the Cambridge-driving Labeled Video Database(CamVid) (Brostow et al., 2008), a dataset introduced forevaluating performance of deep neural networks for seman-tic segmentation with 32 different semantic classes. Further-

Figure 2. An example semantic segmentation label map producedusing EdgeSegNet on a CamVid video. It can be observed thatstrong visual segmentation results can be achieved.

more, we report the model size as well as the inference speedon an NVidia Jetson AGX Xavier module. For comparisonpurposes, the results for ResNet-101 RefineNet (Lin et al.,2017), a state-of-the-art semantic segmentation network, arealso presented.

Table 1. Performance of tested semantic segmentation networkson CamVid

Model Acc (%) Speed1 (FPS) Size (Mb)RefineNet 90.3% -2 343

EdgeSegNet 89.7% 38.5 16.71Computed on NVidia Jetson AGX Xavier2Too large to run due to insufficient memory

As shown in Table 1, the proposed EdgeSegNet achievedsimilar accuracy compared to ResNet-101 RefineNet (differ-ence of just 0.6%), but is >20× smaller in terms of modelsize compared to RefineNet. More interestingly, EdgSegNetachieved an inference speed of ∼38.5 FPS on an NVidiaJetson AGX Xavier module running at 1.37GHz with 512CUDA cores, while RefineNet was too large to run due toinsufficient memory (for context, RefineNet runs at just∼28FPS on an NVidia GTX 1080Ti running at 1.4 GHz with3584 CUDA cores). An example semantic segmentationlabel map produced using EdgeSegNet on a CamVid videois shown in Fig. 2. It can be observed that strong visualsegmentation results can be achieved using the proposedEdgeSegNet.

The results of the experiments demonstrate that the pro-posed EdgeSegNet was able to achieve state-of-the-art per-formance while being noticeably smaller and requiring sig-nificantly fewer computations. As such, EdgeSegNet iswell-suited for the purpose of semantic segmentation in on-device edge and mobile scenarios where resources are verylimited yet the speed of inference needs to be fast.

Page 4: EdgeSegNet: A Compact Network for Semantic Segmentation · EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ’) based on human-driven

EdgeSegNet: A Compact Network for Semantic Segmentation

ReferencesBrostow, G. et al. Semantic object classes in video: A

high-definition ground truth database. In PRL, 2008.

Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H.Rethinking atrous convolution for semantic image seg-mentation. In arXiv:1706.05587, 2017.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., andYuille, A. L. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fullyconnected crfs. IEEE transactions on pattern analysisand machine intelligence, 40(4):834–848, 2018a.

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam,H. Encoder-decoder with atrous separable convolutionfor semantic image segmentation. In ECCV, 2018b.

Courbariaux, M., Bengio, Y., and David, J.-P. Binarycon-nect: Training deep neural networks with binary weightsduring propagations. In Advances in neural informationprocessing systems, pp. 3123–3131, 2015.

Elsken, T., Metzen, J. H., and Hutter, F. Multi-objective architecture search for cnns. arXiv preprintarXiv:1804.09081, 2018.

Han, S., Mao, H., and Dally, W. J. Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149, 2015.

He, K. et al. Deep residual learning for image recognition.arXiv:1512.03385, 2015.

Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:Efficient convolutional neural networks for mobile visionapplications. arXiv preprint arXiv:1704.04861, 2017.

Hsu, C.-H., Chang, S.-H., Juan, D.-C., Pan, J.-Y., Chen,Y.-T., Wei, W., and Chang, S.-C. Monas: Multi-objectiveneural architecture search using reinforcement learning.arXiv preprint arXiv:1806.10332, 2018.

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and¡ 0.5 mb modelsize. arXiv preprint arXiv:1602.07360, 2016.

Jacob, B. et al. Quantization and training of neuralnetworks for efficient integer-arithmetic-only inference.arXiv:1712.05877, 2017.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature,521(7553):436, 2015.

Lin, G., Milan, A., Shen, C., and Reid, I. Refinenet: Multi-path refinement networks for high-resolution semanticsegmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pp. 1925–1934,2017.

Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2:Practical guidelines for efficient cnn architecture design.In Proceedings of the European Conference on ComputerVision (ECCV), pp. 116–131, 2018.

Meng, W. et al. Two-bit networks for deeplearning on resource-constrained embedded devices.arXiv:1701.00485, 2017.

Ravi, S. ProjectionNet: Learning efficient on-device deepnetworks using neural projections. arXiv:1708.00630,2017.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., andChen, L.-C. Mobilenetv2: Inverted residuals and linearbottlenecks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.

Shafiee, M. J., Li, F., Chwyl, B., and Wong, A. Squished-nets: Squishing squeezenet further for edge device sce-narios via deep evolutionary synthesis. NIPS Workshopon Machine Learning on the Phone and other ConsumerDevices, 2017.

Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V.Mnasnet: Platform-aware neural architecture search formobile. arXiv preprint arXiv:1807.11626, 2018.

Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X.,and Cottrell, G. Understanding convolution for semanticsegmentation. In Proceedings of WACV, 2017.

Wong, A. Netscore: Towards universal metrics for large-scale performance analysis of deep neural networks forpractical usage. arXiv preprint arXiv:1806.05512, 2018.

Wong, A., Shafiee, M. J., Chwyl, B., and Li, F. Ferminets:Learning generative machines to generate efficient neuralnetworks via generative synthesis. Advances in neuralinformation processing systems Workshops, 2018a.

Wong, A., Shafiee, M. J., Li, F., and Chwyl, B. Tiny ssd: Atiny single-shot detection deep convolutional neural net-work for real-time embedded object detection. Proceed-ings of the Conference on Computer and Robot Vision,2018b.

Page 5: EdgeSegNet: A Compact Network for Semantic Segmentation · EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ’) based on human-driven

EdgeSegNet: A Compact Network for Semantic Segmentation

Wong, A., Lin, Z. Q., and Chwyl, B. Attonets: Com-pact and efficient deep neural networks for the edgevia human-machine collaborative design. arXiv preprintarXiv:1903.07209, 2019.

Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An ex-tremely efficient convolutional neural network for mobiledevices. In arXiv:1707.01083, 2017.

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramidscene parsing network. In Proceedings of WACV, 2017.