Abstract arXiv:1902.08913v2 [cs.CV] 10 Feb 2020Mario Bijelic Tobias Gruber Fahim Mannan Florian Kraus Werner Ritter Klaus Dietmayer Felix Heide Abstract The fusion of multimodal sensor

Seeing Through Fog Without Seeing Fog:Deep Multimodal Sensor Fusion in Unseen Adverse Weather

Mario Bijelic Tobias Gruber Fahim Mannan Florian Kraus Werner Ritter Klaus Dietmayer Felix Heide

Abstract

The fusion of multimodal sensor streams, such as cam-era, lidar, and radar measurements, plays a critical role inobject detection for autonomous vehicles, which base theirdecision making on these inputs. While existing methodsexploit redundant information under good conditions, theyfail to do this in adverse weather where the sensory streamscan be asymmetrically distorted. These rare “edge-case”scenarios are not represented in available datasets, and ex-isting fusion architectures are not designed to handle them.To address this data challenge we present a novel multi-modal dataset acquired by over 10,000 km of driving innorthern Europe. Although this dataset is the first largemultimodal dataset in adverse weather, with 100k labels forlidar, camera, radar and gated NIR sensors, it does not fa-cilitate training as extreme weather is rare. To this end, wepresent a deep fusion network for robust fusion without alarge corpus of labeled training data covering all asymmet-ric distortions. Departing from proposal-level fusion, wepropose a single-shot model that adaptively fuses features,driven by measurement entropy. We validate the proposedmethod, trained on clean data, on our extensive validationdataset. The dataset and all models will be published.

1. Introduction

Object detection is a fundamental computer vision prob-lem in autonomous robots including self-driving vehiclesand autonomous drones. Such applications require 2D or3D bounding boxes of scene objects in challenging real-world scenarios, including complex cluttered scenes, highlyvarying illumination, and adverse weather conditions. Themost promising autonomous vehicle systems rely on redun-dant inputs from multiple sensor modalities [2], includingcamera, lidar, radar, and emerging sensor such as FIR [29].A growing body of work on object detection using convo-lutional neural networks has enabled accurate 2D and 3Dbox estimation from such multimodal data, relying typicallyon the camera and lidar data [59, 11, 53, 66, 61, 42, 35].While these existing methods, and the autonomous systemthat performs decision making on their outputs, perform

Image-only Detection

Lidar-only Detection

Proposed Fusion Architecture

Figure 1: Existing object detection methods, including effi-cient Single-Shot detectors (SSD) [40], are trained on auto-motive datasets that are biased towards good weather con-ditions. While these methods work well in good condi-tions [19, 2], they fail in rare weather events (top). Lidar-only detectors, such as the same SSD model trained on pro-jected lidar depth, might be distorted due to severe backscat-ter in fog or snow (center). These asymmetric distortionsare a challenge for fusion methods, that rely on redundantinformation. The proposed method (bottom) learns to tackleunseen (potentially asymmetric) distortions in multimodaldata without seeing training data of these rare scenarios.

well under normal imaging conditions, they fail in adverseweather and imaging conditions. This is because trainingdatasets are biased towards normal conditions, and exist-ing detector architectures are designed to rely only on theredundant information in the undistorted sensory streams.However, they are not designed for harsh scenarios whichdistort the sensor streams asymmetrically, see Figure. 1.Extreme weather conditions are statistically rare. For ex-ample, thick fog is observable only during 0.01 % of typ-ical driving in North America, and even in foggy regions,heavy fog with visibility below 50 m occurs only up to

1

arX

iv:1

902.

0891

3v2

[cs

.CV

] 1

0 Fe

b 20

20

15 times a year [57]. Figure 2 shows the distribution ofreal driving data acquired over a two-week period in Swe-den covering 10,000 km driven in February. The natu-rally biased distribution validates that harsh weather sce-narios are only rarely or even not at all represented in avail-able datasets [60, 19, 2]. Unfortunately, domain adaptationmethods [44, 28, 41] also do not offer an ad-hoc solution,because domain adaptaion methods require target samplesand adverse weather-distorted data are underrepresented ingeneral. Moreover, existing methods are limited to imagedata but not to multisensor data, e.g. including lidar point-cloud data.

Existing fusion methods have been proposed mostly forlidar-camera setups [59, 11, 42, 35, 12], as a result of thelimited sensor inputs in existing training datasets [60, 19, 2].These methods do not only struggle with sensor distortionsin adverse weather due to the bias of the training data.They either perform late fusion through filtering after inde-pendently processing the individual sensor streams [12], orthey fuse proposals [35] or high-level feature vectors [59].The network architecture of these approaches is already de-signed with the assumption that the data streams are consis-tent and redundant, i.e. an object appearing in one sensorystream also appears in the other. However, in harsh weatherconditions, such as fog, rain, snow, or difficult lightingcondition, including low-light or low-reflectance objects,multimodal sensor configurations can fail asymmetrically.For example, conventional RGB cameras provide unreliabenoisy measurements in low-light scene areas, while scan-ning lidar sensors provide reliable depth using active illu-mination. In rain and snow, small particles affect the colorimage and lidar depth estimates equally through backscat-ter. Adversely, in foggy or snowy conditions, state-of-the-art pulsed lidar systems are restricted to less than 20 m rangedue to backscatter, see Figure 3. While relying on lidar mea-surements only might be a solution for night driving, it isnot for adverse weather conditions.

In this work, we propose a multimodal fusion methodfor object detection in adverse weather, including fog, snowand harsh rain, without having large annotated trainingdatasets available for these scenarios. Specifically, we han-dle asymmetric measurement corruptions in camera, lidar,radar and gated NIR sensor streams by departing from ex-isting proposal-level fusion methods: we propose an adap-tive single-shot deep fusion architecture which exchangesfeatures in intertwined feature extractor blocks. This deepearly fusion is steered by the measurement entropy, whichis unseen during training. The proposed adaptive fusionallows us to learn models that generalize across scenar-ios. To validate our approach, we address the bias in ex-isting datasets by introducing a novel multimodal datasetacquired on three months of acquisition in northern Eu-rope. This dataset is the first large multimodal driving

dataset in adverse weather, with 100k labels for lidar, cam-era, radar, gated NIR sensor, and FIR sensor. Although theweather-bias still prohibits training, this data allows us tovalidate that the proposed method generalizes robustly tounseen weather conditions with asymmetric sensor corrup-tions, while being trained on clean data.

Specifically, we make the following contributions:

• We introduce a multimodal adverse weather datasetcovering camera, lidar, radar, gated NIR, and FIR sen-sor data. The dataset contains rare scenarios, such asheavy fog, heavy snow, and severe rain, during morethan 10,000 km of driving in northern Europe.

• We propose a deep multimodal fusion network whichdeparts from proposal-level fusion, and instead adap-tively fuses driven by measurement entropy.

• We assess the model on the proposed dataset, validat-ing that it generalizes to unseen asymmetric distor-tions. The approach outperforms state-of-the-art fu-sion methods more than 8% AP in hard scenarios in-dependent of weather, including light fog, dense fog,snow and clear conditions, and it runs in real-time.

2. Related Work

Detection in Adverse Weather Conditions Over the lastdecade, seminal work on automotive datasets [6, 14, 19, 16,60] has provided a fertile ground for automotive object de-tection [11, 9, 59, 35, 40, 20], depth estimation [18, 39, 21],lane-detection [26], traffic-light detection [32] road scenesegmentation [6, 4], and end-to-end driving models [5, 60].Although existing datasets fuel this research area, they arebiased towards good weather conditions due to geographiclocation [60] and captured season [19], and thus lack severedistortions introduced by rare fog, severe snow and rain.A number of recent works provide camera-only captures insuch adverse conditions [49, 8, 3]. However, these datasetsare very small with less than 100 captured images [49] andlimited to camera-only vision tasks. In contrast, existingautonomous driving applications rely on multimodal sensorstacks, including camera, radar, lidar and emerging sensor,such as gated NIR imaging [22, 23], and have to be evalu-ated on thousands of hours of driving. In this work, we fillthis gap and introduce a large scale evaluation set in orderto develop a fusion model for such multimodal inputs thatis robust to unseen distortions.

Data Preprocessing in Adverse Weather A large body ofwork explores methods for the removal of sensor distortionsbefore processing. Especially fog and haze removal fromconventional intensity image data have been explored exten-sively [62, 65, 33, 51, 36, 8, 37]. Fog results in a distance-dependent loss in contrast and color. Fog removal methodshave not only been suggested for display application [25],

it has also been proposed as preprocessing to improve theperformance of downstream semantic tasks [49]. Existingfog and haze removal methods rely on scene priors on thelatent clear image and depth to solve the ill-posed recovery.These priors are either hand-crafted [25] and used for depthand transmission estimation separately, or they are learnedjointly as part of trainable end-to-end models [37, 31, 67].Existing methods for fog and visibility estimation [54, 55]have been proposed for camera-based driver-assistance sys-tems. Image restoration approaches have also been appliedto deraining [10] or deblurring [36].

Domain Adaptation Another line of research tackles theshift of unlabelled data distributions by domain adaptation[56, 28, 48, 27, 64]. Such methods could be applied to adaptclear labeled scenes to difficult adverse weather scenes [28]or through the adaptation of feature representations [56].Unfortunately, both of these approaches struggle to gener-alize, because, in contrast to existing domain transfer meth-ods, weather-distorted data in general, not only labeled data,is underrepresented. Moreover, existing methods are lim-ited to image data and still not applied to multisensor data.

Multisensor Fusion Multisensor feeds in autonomous ve-hicles are typically fused to exploit varying cues in the mea-surements [43], simplify path-planning [15], to allow forredundancy in the presence of distortions [45], or solvefor joint vision tasks, such as 3D object detection [59].Existing sensing systems for fully-autonomous driving in-clude lidar, camera and radar sensors. As large automotivedatasets [60, 19, 2] cover limited sensory inputs, existing fu-sion methods have been proposed mostly for lidar-camerasetups [59, 11, 35, 42]. Methods such as AVOD [35] andMV3D [11] incorporate multiple views from camera and li-dar to detect objects. They rely on fusion of pooled regionsof interest and hence perform late feature fusion followingpopular region proposal architectures [47]. In a differentline of research, Qi et al. [46] and Xu et al. [59] propose apipeline model that requires a valid detection output for thecamera image and a 3D feature vector extracted from the li-dar point cloud. Kim et al.[34] propose a gating mechanismfor camera-lidar fusion. In all existing methods, the sensorstreams are processed separately in the feature extractionstage, and we show that this prohibits learning redundan-cies, and in fact performs worse than a single sensor streamin the presence of asymmetric measurement distortions.

3. Multimodal Adverse Weather DatasetTo assess object detection in adverse weather, we have

acquired a large-scale automotive dataset covering rare ad-verse weather situations. Table 1 compares our datasetto recent large-scale automotive datasets, such as theWaymo [2], NuScenes [7], KITTI [19] and the BDD [19]dataset. The proposed dataset provides 2D and 3D detection

DATASET KITTI [19] BDD [63] Waymo [2] NuScenes [7] OursSENSOR SETUP

RGB CAMERAS 2 1 5 6 2RGB RESOLUTION 1242×372 1280×720 1920×1080 1600x900 1920x1024LIDAR SENSORS 1 7 5 1 2LIDAR RESOLUTION 64 0 64 32 64RADAR SENSOR 7 7 7 4 1GATED CAMERA 7 7 7 7 1FIR CAMERA 7 7 7 7 1FRAME RATE 10 Hz 30 Hz 10 Hz 1 Hz/10 Hz 10 HzDATASET STATISTICS

LABELED FRAMES 15K 100k 198k 40K 13.5K2D LABELS 80k 1.47M 7.87M 1.4M 100K3D LABELS 80k - 7.87M 1.4M 100KSCENE TAGS 7 3 7 3 3NIGHT TIME 7 3 3 3 3LIGHT WEATHER 7 3 7 3 3HEAVY WEATHER 7 7 7 7 3FOG CHAMBER 7 7 7 7 3

Table 1: Comparison of the proposed multimodal adverseweather dataset to existing automotive detection datasets.

bounding boxes for multimodal data with fine classificationof weather, illumination and scene type. A detailed descrip-tion of the annotation procedures and label specifications isgiven in the supplemental material. With this cross-weatherannotation of multimodal sensor data and broad geograph-ical sampling, it is the only existing dataset that allows forthe assessment of our multimodal fusion approach. In thefuture, we envision researchers developing and evaluatingmultimodal fusion methods in weather conditions not cov-ered in existing datasets.

Figure 2 provides the weather distribution of the pro-posed dataset. To collect these statistics, synchronizedframes from all sensors were exported at a frame rate of0.1 Hz and manually annotated. Human annotators wereguided to distinguish light from dense fog when the visi-bility fell below 1 km [1] and 100 m, respectively. If fogoccurred together with precipitation, the scenes were eitherlabeled as snowy or rainy depending on the environmentroad conditions. For our experiments we combined snowand rainy conditions. Note that the statistics validate therareness of scenes in heavy adverse weather, which is inagreement to [57] and demonstrates the difficulty and criti-cal nature of obtaining such data in the assessment of trulyself-driving vehicles, i.e. without interaction of remote op-erators outside of geo-fenced areas. We have experiencedthat extreme adverse weather conditions occur only locallyand change very quickly.

The individual weather conditions result in asymmetricalperturbations of various sensor technologies. While stan-dard passive cameras perform well in daytime conditions,their performance degrades in night-time conditions or chal-lenging illumination settings such as low sun illumination.Meanwhile, active scanning sensors as lidar and radar areless affected by ambient light changes due to active illu-mination and a narrow bandpass on the detector side. Onthe other hand, active lidar sensors are highly degraded byscattering media as fog, snow or rain, limiting the maximalperceivable distance at fog densities below 50 m to 25 m,

Clear

Dense

Fog

Light F

ogRain

Snow

0

500k

1Mio1,000,000

10,060 21,800 22,200

375,000

Weather Distribution

Geographical SamplingVehicle SetupQ

uant

ity[#

Fram

es]

Figure 2: Right: Geographical coverage of the data collection campaign covering two months and 10,000 km in Germany,Sweden, Denmark and Finland. Top Left: Test vehicle setup with top-mounted lidar, gated camera with flash illumination,RGB camera, proprietary radar, FIR camera, weather station and road friction sensor. Bottom Left: Distribution of weatherconditions throughout the data acquisition. The driving data is highly unbalanced with respect to weather conditions and onlycontains adverse conditions as rare samples.

Figure 3: Multimodal sensor response of RGB camera,scanning lidar, gated camera and radar in a fog chamberwith dense fog. Reference recordings under clear condi-tions are shown the first row, recordings in fog with visibil-ity of 23 m are shown in the second row.

see Figure 3. Millimeter-wave radar waves do not stronglyscatter in fog [24] but currently provide only low azimuthalresolution. Recent gated images have shown robust percep-tion in adverse weather [23], provide high spatial resolution,but are lacking color information compared to standard im-agers. With these sensor-specific weaknesses and strengthsof each sensor, multimodal data can be crucial in robust de-tection methods and the uncertainty estimation. Next, wedescribe the sensor setup and capture statistics.

3.1. Multimodal Sensor Setup

We have equipped a test vehicle with sensors coveringthe visible, mm-wave, NIR and FIR band, see Figure 2. Wemeasure intensity, depth and weather condition.

Stereo Camera As visible-wavelength RGB cameraswe use a stereo pair of two front-facing high-dynamicrange automotive RCCB cameras, consisting of two On-Semi AR0230 imagers with a resolution of 1920 × 1024,a baseline of 20.3 cm and 12 bit quantization. The camerasrun at 30 Hz and are synchronized for stereo imaging. UsingLensagon B5M8018C optics with a focal length of 8 mm, afield-of-view of 39.6 ° × 21.7 ° is obtained.

Gated camera We capture gated images in the NIR bandat 808 nm using a BrightwayVision BrightEye camera op-erating at 120 Hz with a resolution of 1280× 720 and a bitdepth of 10 bit. The camera provides a similar field-of-viewas the stereo camera with 31.1 ° × 17.8 °. Gated imagersrely on time synchronized camera and flood-lit flash lasersources [30]. The laser pulse emits a variable narrow pulseand the camera captures the laser echo after an adjustabledelay. This enables to significantly reduce backscatter from

particles in adverse weather. Furthermore, the high im-ager speed enables to capture multiple overlapping sliceswith different range-intensity profiles encoding extractabledepth information in between multiple slices [23]. Follow-ing [23], we capture 3 broad slices for depth estimation andadditionally 3-4 narrow slices together with their passivecorrespondence at a system sampling rate of 10 Hz.Radar For radar sensing, we use a proprietary frequency-modulated continuous wave (FMCW) radar at 77 GHz with1 ° angular resolution and distances up to 200 m. The radarprovides position-velocity detections at 15 Hz.Lidar On the roof of the car, we mount two laser scannersfrom Velodyne, namely HDL64 S3D and VLP32C. Both areoperating at 903 nm and can provide dual returns (strongestand last) at 10 Hz. While the Velodyne HDL64 S3D pro-vides equally distributed 64 scanning lines with an angularresolution of 0.4 °, the Velodyne VLP32C offers 32 non-linear distributed scanning lines. HDL64 S3D and VLP32Cscanners achieve a range of 100 m and 120 m, respectively.FIR camera Thermal images are captured with an AxisQ1922 FIR camera at 30 Hz. The camera offers a resolu-tion of 640 × 480 with a pixel pitch of 17 µm and a noiseequivalent temperature difference (NETD) < 100 mK.Environmental Sensors To collect environmental in-formation, the vehicle has been equipped with an AirmarWX150 weather station that provides temperature, windspeed and humidity, and a proprietary road friction sen-sor. All sensors are time-synchronized and ego-motion cor-rected using a proprietary inertial measurement unit (IMU).The overall system provides a sampling rate of 10 Hz.

3.2. Recordings

Real-world Recordings All sensory data has been cap-tured during two test drives in February and December2019 in Germany, Sweden, Denmark and Finland for twoweeks each, covering a distance of 10,000 km under differ-ent weather and illumination conditions. A total of 1.4 mil-lion frames at a frame rate of 10 Hz have been collected. Tobalance scene type coverage, every 100th frame was manu-ally labeled. The resulting annotations contain 5,500 com-pletely clear images, 1,000 captures in dense fog, 1,000 cap-tures in light fog and 4,000 captures in snow. Given the ex-tensive capture effort, this demonstrates that training data inharsh conditions is extremely rare. We tackle this approachby training only on clear data and relying on measuremententropy to generalize to unseen scenarios. This allows us touse a balanced weather distribution for evaluation.Controlled Condition Recordings To collect image andrange data under controlled conditions, we also providemeasurements acquired in a fog chamber. Details on the fogchamber setup can be found in [17, 13]. We have captured35k frames at a frame rate of 10 Hz and labeled a subset

of 1,500 images under 2 different illumination conditions(day/night) and 3 fog densities with meteorological visibil-ities V of 30 m, 40 m and 50 m. Details are given in thesupplemental material, where we also do comparisons to asimulated dataset, using the forward model from [49].

4. Adaptive Deep FusionIn this section, we describe the proposed adaptive deep

fusion architecture that allows for multimodal fusion underunseen asymmetric sensor distortions. We devise our archi-tecture under real-time processing constraints required forself-driving vehicles and autonomous drones. Specifically,we propose an efficient single-shot fusion architecture.

4.1. Adaptive Multimodal Single-Shot Fusion

The proposed network architecture is shown in Figure 4.It consists of multiple single-shot detection branches, eachanalyzing one sensor modality.Data Representation The camera branch uses conven-tional three-plane RGB inputs, while for the lidar andradar branch, we depart from recent bird’s eye-view (BeV)projection [35] schemes or raw point-cloud representa-tions [59]. BeV projection or point-cloud inputs do notallow for deep early fusion as the feature representationsin the early layers are inherently different from the camerafeatures. Hence, existing BeV fusion methods can only fusefeatures in a lifted space, subsequent to matching regionproposals, but not earlier. Figure 4 visualizes the proposedinput data encoding which aids deep multimodal fusion. In-stead of using a naive depth-only input encoding, we pro-vide depth, height and pulse intensity as input to the lidarnetwork. For the radar network we assume that the radar isscanning in a 2D-plane orthogonal to the image plane andparallel to the image horizontal dimension. Therefore, weconsider radar invariant along the vertical image axis. Toaid the multimodal fusion by matching the input projection,we replicate the scan across the horizontal axis. Gated im-ages are transformed to the image plane of the RGB cam-era using a homography mapping, see supplemental mate-rial. The proposed input encoding allows for position andintensity-dependent fusion with pixelwise correspondencesin-between different streams. We encode missing measure-ment samples with zero intensity.Feature Extraction As feature extraction stack in eachstream, we use a modified VGG [52] backbone. Similar to[35, 11] we reduce the number of channels by half and cutthe network at the conv4 layer. Inspired by [40, 38] we use 6feature layers from conv4-10 as input to SSD detection lay-ers. The feature maps decrease in size1 achieving a featurepyramid for detections at different scales. As shown in Fig-ure 4, the activations of different feature extraction stacks

1We use a feature map pyramid [(24, 78), (24, 78), (12, 39), (12, 39), (6, 20), (3, 10)]

Deep Feature ExchangeEntropy Exchange

SSD Bbox Block SSD Feature Block

Features

Entropy

Concat

Conv

(3x3)

*

Pointwise Mult.

Sigmoid

Concat

Figure 4: Overview of our architecture consisting of four single-shot detector branches with deep feature exchange andadaptive fusion of lidar, RGB camera, gated camera and radar. All sensory data is projected into the camera coordinate systemfollowing Sec. 4.1. To enable a steered fusion in-between sensors, the sensor entropy is provided to each feature exchangeblock (red). The deep feature exchange blocks (white) interchange information (blue) with parallel feature extraction blocks.The fused feature maps are analyzed by SSD blocks (orange).

are exchanged. To steer fusion towards the most reliableinformation, we provide the sensor entropy to each featureexchange block. We first convolve the entropy, apply a sig-moid, multiply with the concatenated input features fromall sensors and finally concatenate the input entropy. Thefolding of entropy and application of the sigmoid generatesa multiplication matrix in the interval [0,1]. This scales theconcatenated features for each sensor individually based onthe available information. Regions with low entropy canbe attenuated, while entropy rich regions can be amplifiedin the feature extraction. Doing so allows us to adaptivelyfuse features in the feature extraction stack itself, which wemotivate in depth in the next section.

4.2. Entropy-steered Fusion

To steer the deep fusion towards redundant and reliableinformation, we introduce an entropy channel in each sen-sor stream, instead of directly inferring the adverse weathertype and strength as in [54, 55]. We estimate local measure-ment entropy

ρ =

w,h∑m,n

255∑i=0

pmni log (pmn

i ) with

pmni =

1

MN

M,N∑j,k

δ (I(m+ j, n+ k)− i) .

(1)

The entropy is calculated for each 8 bit binarized stream Iwith pixel values i ε [0, 255] in the proposed image-spacedata representation. Each stream is split into patches ofsize M × N = 16px × 16 px resulting in a w × h =1920 px1024 px entropy map. The multimodal entropymaps for two different scenarios are shown in Figure 5: the

left scenario shows a scene containing a vehicle, cyclist andpedestrians in a controlled fog chamber. The passive RGBcamera and lidar suffer from backscatter and attenuationwith decreasing fog visibilities, while the gated camera sup-presses backscatter through gating. Radar measurementsare also not substantially degraded in fog. The right sce-nario in Figure 5 shows a static outdoor scene under vary-ing ambient lighting. In this scenario, active lidar and radarare not affected by changes in ambient illumination. For thegated camera, the ambient illumination disappears leavingonly the actively illuminated areas, while the passive RGBcamera degenerates with reduced ambient light.

The steering process is learned purely on clean weatherdata, which contains different illumination settings presentin day to night-time conditions. No real adverse weatherpatterns are presented during training. Further we drop sen-sor streams randomly with probability 0.5 and set the en-tropy to a constant zero value.

4.3. Loss Functions and Training Details

The number of anchor boxes in different feature layersand their sizes play an important role during training and thechosen configuration are given in the supplemental material.In total, each anchor box with class label yi and probabilitypi is trained using the cross entropy loss with softmax,

H(p) =∑i

(yi log(pi) + (1− yi) log(1− pi)). (2)

The loss is split up for positive and negative anchor boxeswith a matching threshold of 0.5. For each positive anchorbox, the bounding box coordinates x are regressed using a

Projected Entropies: Gated Image Lidar RadarProjected Measurements: Gated Image Lidar Radar

20m

30m

40m

50m · ·

·· ··

∞m

0

0.5

1

Fog Visiblity m

Nor

mal

ized

Ent

ropy

[%]

Nor

mal

ized

Ent

ropy

[%]

Asymetric Sensor Performance: Clear→ Fog Asymetric Sensor Performance: Day→ Night

17:21

17:22

18:17

18:33

18:54

19:03

19:10

0

0.5

1

Time h

Figure 5: Normalized entropy with respect to the clear reference recording for a gated camera, rgb camera, radar and lidar invarying fog visibilities (left) and changing illumination (right). The entropy has been calculated based on a dynamic scenariowithin a controlled fog chamber illustrated in Figure 3 (left) and a static scenario with changing natural illumination settings(right). Quantitative numbers have been calculated following Eq. (1). Note the asymmetric sensor failure for different sensortechnologies. Qualitative results are given below and are connected via arrows to their corresponding fog density/daytime.

Huber loss H(x) given by,

H(x) =

{x2/2, if |x| < 1|x| − 0.5, if |x| > 1

(3)

The total number of negative anchors is restricted to 5× thenumber of positive examples using hard example mining[40, 50]. All networks are trained from scratch with a con-stant learning rate and L2 weight decay of 0.0005.

5. AssessmentIn this section, we validate the proposed fusion model

on unseen experimental test data. We compare the methodagainst existing detectors for single sensory inputs and fu-sion methods, as well as domain adaptation methods. Dueto the weather-bias of training data acquisition, we only usethe clear weather portion of the proposed dataset for train-ing. We assess the detection performance using our novelmultimodal weather dataset as a validation set, see supple-mental data for test and training split details.

We validate the proposed approach in Table 2, which wedub Deep Entropy Fusion, on real adverse weather data. Wereport Average Precision (AP) for three different difficultylevels (easy, moderate, hard) and evaluate on cars following

the KITTI evaluation framework [19] at various differentfog densities, snow disturbances and clear weather condi-tions. We compare the proposed model against recent state-of-the-art lidar-camera fusion models, including AVOD-FPN [35], Frustum PointNets [46], and variants of the pro-posed method with alternative fusion or sensory inputs. Asbaseline variants, we implement two fusion and four sin-gle sensor detectors. In particular, we compare against latefusion with image, lidar, gated and radar features concate-nated just before bounding-box regression (Fusion SSD),and early fusion by concatenating all sensory data at theearly beginning of one feature extraction stack (ConcatSSD). The Fusion SSD network shares the same structureas the proposed model, but without the feature exchangeand the adaptive fusion layer. Moreover, we compare theproposed model against an identical SSD branch with sin-gle sensory input (Image-only SSD, Gated-only SSD, Lidar-only SSD, Radar-only SSD). All models were trained withidentical hyper-parameters and anchors, see SupplementalMaterial for details.

Evaluated on adverse weather scenarios, the detection1Note that these methods have seen adverse weather data in our val-

idation set.

WEATHER clear light fog dense fog snow/rainDIFFICULTY easy mod. hard easy mod. hard easy mod. hard easy mod. hard

DEEP ENTROPY FUSION (THIS WORK) 89.84 85.57 79.46 90.54 87.99 84.90 87.68 81.49 76.69 88.99 83.71 77.85DEEP FUSION (THIS WORK) 90.07 80.31 77.82 90.60 81.08 79.63 86.77 77.28 73.93 89.25 79.09 70.51FUSION SSD 87.73 78.02 69.49 88.33 78.65 76.54 74.07 68.46 63.23 85.49 75.28 67.48CONCAT. SSD 86.12 76.62 68.61 87.98 78.24 70.17 77.99 69.16 67.07 83.63 73.65 66.26ADDA [56]1 85.27 70.51 67.86 87.83 78.68 70.38 87.64 78.12 74.37 84.17 74.25 66.86CYCADA [28]1 88.50 77.84 69.56 89.08 79.36 75.58 87.24 77.04 73.38 85.56 74.80 67.22IMAGE-ONLY SSD 85.43 75.75 67.79 87.76 78.52 70.43 87.89 78.25 74.96 84.33 74.38 67.01GATED-ONLY SSD 77.10 61.95 58.27 80.65 69.64 61.75 75.16 66.76 61.68 77.32 61.31 57.23LIDAR-ONLY SSD 73.46 57.32 54.62 68.43 54.82 51.91 28.98 25.24 24.56 67.50 52.26 46.83RADAR-ONLY SSD 10.26 8.54 8.23 16.92 13.24 12.66 16.33 13.57 13.00 12.94 10.95 10.40AVOD-FPN [35] 66.47 58.71 51.63 60.40 52.51 51.92 33.95 26.29 26.17 59.55 51.91 50.54FRUSTUM POINTNET [46] 80.06 75.89 67.70 84.06 76.88 75.44 76.69 73.62 68.49 78.34 74.34 66.52

Table 2: Quantitative detection AP on real unseen weather-affected data from dataset split across weather and difficultieseasy/moderate/hard following [19]. All detection models except domain adaptation approaches are trained solely on cleandata without weather distortions. The best model is highlighted in bold.

performance decrease for all methods. Note that assess-ment metrics can increase simultaneously as scene com-plexity changes in between the weather splits. For exam-ple, when fewer vehicles participate in road traffic or thedistance between vehicles increases in icy conditions, lessvehicles are occluded. While the performance for imageand gated data is almost steady, it decreases massively forlidar data while it increases for radar data. The decrease inlidar performance can be described by the strong backscat-ter, see Supplemental Material. With the low performanceof the radar data due to the maximum of 100 measurementtargets, the reported improvements are resulting from sim-plified scenes.

Overall, the strong drop in lidar performance for foggyconditions affect the lidar only detection rate by a drop in45.38 % AP. Furthermore, it also has a strong impact oncamera-lidar fusion models AVOD, Concat SSD and FusionSSD. Learned redundancies no longer hold and these meth-ods even fall below image-only methods.

Two-stage methods, such as Frustum PointNet [46], dropquickly. However, they asymptotically achieve higher re-sults compared to AVOD, because the statistical priorslearned for the first stage are based on Image-only SSD thatlimits its performance to image-domain priors. AVOD islimited to several clear weather preassumptions as its im-portance sampling of boxes filled with lidar data duringtraining, achieving the lowest fusion overall. Moreover,as the fog density increases, the proposed adaptive fusionmodel outperforms all other methods. Especially under se-vere distortions, the proposed adaptive fusion layer resultsin significant margins over the model without it (Deep Fu-sion). Overall the proposed method outperforms all base-line approaches. In dense fog, it improves by a margin of9.69 % compared to the next-best feature-fusion variant.

For completeness, we also compare the proposed modelto recent domain adaptation methods. First, we adapt ourImage-Only SSD features from clear weather to adverseweather following [56]. Second, we investigate the styletransfer from clear weather to adverse weather utilizing [28]

and generate adverse weather training samples from clearweather input. Note that these method have an unfair ad-vantage over all other compared approaches as they haveseen adverse weather scenarios sampled from our validationset. Hence, that domain adaptation methods in fact cannotbe directly applied as they need target images from a spe-cific domain. Therefore, they do not offer a solution forrare edge cases with limited data. Furthermore [28] doesnot model distortions, including fog or snow, see experi-ments in the supplemental Material. We note that syntheticdata augmentation following [49] or image-to-image recon-struction methods that remove adverse weather effects [58]do neither affect the reported margins of the proposed multi-modal deep entropy fusion.

6. Conclusion and Future WorkIn this paper we address a critical problem in au-

tonomous driving: multi-sensor fusion in scenarios wereannotated data is sparse and difficult to obtain due tonatural weather bias. To assess multi-modal fusion inadverse weather, we introduce a novel adverse weatherdataset covering camera, lidar, radar, gated NIR, andFIR sensor data. The dataset contains rare scenarios,such as heavy fog, heavy snow, and severe rain, duringmore than 10,000 km of driving in northern Europe. Wepropose a real-time deep multimodal fusion network whichdeparts from proposal-level fusion, and instead adaptivelyfuses driven by measurement entropy. The model istrained on clean data only and we assess the proposedapproach using the our novel dataset, validating that theapproach outperforms state-of-the-art fusion methods forhard scenarios by more than 8 % AP – independently ofthe weather conditions. Exciting directions for futureresearch include the development of end-to-end mod-els enabling the failure detection and an adaptive sensorcontrol as noise level or power level control in lidar sensors.

References[1] Federal meteorological handbook no. 1: Surface weather ob-

servations and reports. u.s. department of commerce / na-

tional oceanic and atmospheric administration. 2005. 3[2] Waymo open dataset: An autonomous driving dataset, 2019.

1, 2, 3[3] C. O. Ancuti, C. Ancuti, R. Timofte, and C. D.

Vleeschouwer. O-haze: A dehazing benchmark with realhazy and haze-free outdoor images. Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition Workshops, pages 867–8678, 2018. 2

[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for imagesegmentation. arXiv preprint arXiv:1511.00561, 2015. 2

[5] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016. 2

[6] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Seg-mentation and recognition using structure from motion pointclouds. In Proceedings of the IEEE European Conference onComputer Vision, pages 44–57. Springer, 2008. 2

[7] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong,Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom.nuscenes: A multimodal dataset for autonomous driving.arXiv preprint arXiv:1903.11027, 2019. 3

[8] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. DehazeNet:An end-to-end system for single image haze removal.IEEE Transactions on Image Processing, 25(11):5187–5198,2016. 2

[9] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unifiedmulti-scale deep convolutional neural network for fast objectdetection. In Proceedings of the IEEE European Conferenceon Computer Vision, pages 354–370. Springer, 2016. 2

[10] J. Chen, C.-H. Tan, J. Hou, L.-P. Chau, and H. Li. Robustvideo content alignment and compensation for clear visionthrough the rain. arXiv preprint arXiv:1804.09555, 2018. 3

[11] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3dobject detection network for autonomous driving. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1907–1915, 2017. 1, 2, 3, 5

[12] H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar. Amulti-sensor fusion system for moving object detection andtracking in urban driving environments. In Robotics and Au-tomation (ICRA), 2014 IEEE International Conference on,pages 1836–1843. IEEE, 2014. 2

[13] M. Colomb, J. Dufour, M. Hirech, P. Lacote, P. Morange,and J.-J. Boreux. Innovative artificial fog production device-a technical facility for research activities. 2004. 5

[14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016. 2

[15] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel.Path planning for autonomous vehicles in unknown semi-structured environments. The International Journal ofRobotics Research, 29(5):485–501, 2010. 3

[16] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedes-trian detection: An evaluation of the state of the art. IEEE

Transactions on Pattern Analysis and Machine Intelligence,34(4):743–761, 2012. 2

[17] P. Duthon, F. Bernardin, F. Chausse, and M. Colomb.Methodology used to evaluate computer vision algorithms inadverse weather conditions. volume 14, pages 2178–2187.Elsevier, 2016. 5

[18] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. InAdvances in neural information processing systems, pages2366–2374, 2014. 2

[19] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 3354–3361, 2012. 1, 2, 3, 7, 8

[20] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages1440–1448, 2015. 2

[21] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsuper-vised monocular depth estimation with left-right consistency.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 270–279, 2017. 2

[22] Y. Grauer. Active gated imaging in driver assistance system.Advanced Optical Technologies, 3(2):151–160, 2014. 2

[23] T. Gruber, F. Julca-Aguilar, M. Bijelic, and F. Heide.Gated2depth: Real-time dense lidar from gated images. InProceedings of the IEEE International Conference on Com-puter Vision, 2019. 2, 4, 5

[24] S. Hasirlioglu, A. Kamann, I. Doric, and T. Brandmeier. Testmethodology for rain influence on automotive surround sen-sors. In IEEE International Conference on Intelligent Trans-portation Systems, pages 2242–2247. IEEE, 2016. 4

[25] K. He, J. Sun, and X. Tang. Single image haze removal usingdark channel prior. IEEE Transactions on Pattern Analysisand Machine Intelligence, 33(12):2341–2353, 2011. 2, 3

[26] A. B. Hillel, R. Lerner, D. Levi, and G. Raz. Recent progressin road and lane detection: a survey. Machine vision andapplications, 25(3):727–745, 2014. 2

[27] J. Hoffman, T. Darrell, and K. Saenko. Continuous mani-fold based adaptation for evolving visual domains. Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 867–874, 2014. 3

[28] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adver-sarial domain adaptation. arXiv preprint arXiv:1711.03213,2017. 2, 3, 8

[29] P. Hurney, P. Waldron, F. Morgan, E. Jones, and M. Glavin.Review of pedestrian detection techniques in automotive far-infrared video. IET intelligent transport systems, 9(8):824–832, 2015. 1

[30] S. Inbar and O. David. Laser gated camera imaging systemand method, may 2008. 4

[31] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. pages1125–1134, 2017. 3

[32] M. B. Jensen, M. P. Philipsen, A. Møgelmose, T. B. Moes-lund, and M. M. Trivedi. Vision for looking at traffic lights:Issues, survey, and perspectives. IEEE Transactions on In-telligent Transportation Systems, 17(7):1800–1815, 2016. 2

[33] S. Ki, H. Sim, J.-S. Choi, S. Kim, and M. Kim. Fully end-to-end learning based conditional boundary equilibrium ganwith receptive field sizes enlarged for single ultra-high res-olution image dehazing. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition Work-shops, pages 817–824, 2018. 2

[34] J. Kim, J. Choi, Y. Kim, J. Koh, C. C. Chung, and J. W. Choi.Robust camera lidar sensor fusion via deep gated informa-tion fusion network. In IEEE Intelligent Vehicle Symposium,pages 1620–1625. IEEE, 2018. 3

[35] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslan-der. Joint 3d proposal generation and object detection fromview aggregation. In IEEE/RSJ International Conference onIntelligent Robots and Systems, pages 1–8. IEEE, 2018. 1, 2,3, 5, 7, 8

[36] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, andJ. Matas. Deblurgan: Blind motion deblurring using condi-tional adversarial networks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages8183–8192, 2018. 2, 3

[37] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. An all-in-one network for dehazing and beyond. arXiv preprintarXiv:1707.06543, 2017. 2, 3

[38] T. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Be-longie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 936–944, July 2017. 5

[39] F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depthfrom single monocular images using deep convolutional neu-ral fields. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 38(10):2024–2039, 2016. 2

[40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. Ssd: Single shot multibox detector. InProceedings of the IEEE European Conference on ComputerVision, pages 21–37. Springer, 2016. 1, 2, 5, 7

[41] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditionaladversarial domain adaptation. In Advances in Neural Infor-mation Processing Systems, pages 1640–1650, 2018. 2

[42] W. Luo, B. Yang, and R. Urtasun. Fast and furious: Realtime end-to-end 3d detection, tracking and motion forecast-ing with a single convolutional net. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3569–3577, 2018. 1, 2, 3

[43] O. Mees, A. Eitel, and W. Burgard. Choosing smartly: Adap-tive multimodal fusion for object detection in changing en-vironments. In IEEE/RSJ International Conference on In-telligent Robots and Systems, pages 151–156. IEEE, 2016.3

[44] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, andK. Kim. Image to image translation for domain adaptation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4500–4509, 2018. 2

[45] C. Premebida, O. Ludwig, and U. Nunes. Lidar andvision-based pedestrian detection system. Journal of FieldRobotics, 26(9):696–711, 2009. 3

[46] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustumpointnets for 3d object detection from rgb-d data. In Pro-

ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 918–927, 2018. 3, 7, 8

[47] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-wards real-time object detection with region proposal net-works. pages 91–99, 2015. 3

[48] C. Sakaridis, D. Dai, and L. V. Gool. Guided curriculummodel adaptation and uncertainty-aware evaluation for se-mantic nighttime image segmentation. In Proceedings of theIEEE International Conference on Computer Vision, pages7374–7383, 2019. 3

[49] C. Sakaridis, D. Dai, and L. Van Gool. Semantic foggy sceneunderstanding with synthetic data. International Journal ofComputer Vision, pages 1–20, 2018. 2, 3, 5, 8

[50] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining.Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, Jun 2016. 7

[51] H. Sim, S. Ki, J.-S. Choi, S. Seo, S. Kim, and M. Kim. High-resolution image dehazing with respect to training losses andreceptive field sizes. pages 912–919, 2018. 2

[52] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 5

[53] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob-ject detection in rgb-d images. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 808–816, 2016. 1

[54] R. Spinneker, C. Koch, S. Park, and J. J. Yoon. Fast fog de-tection for camera based advanced driver assistance systems.In IEEE International Conference on Intelligent Transporta-tion Systems, pages 1369–1374, Oct 2014. 3, 6

[55] J.-P. Tarel, N. Hautiere, A. Cord, D. Gruyer, and H. Hal-maoui. Improved visibility of road scene images under het-erogeneous fog. In Intelligent Vehicles Symposium (IV),2010 IEEE, pages 478–485. Citeseer, 2010. 3, 6

[56] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversar-ial discriminative domain adaptation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 7167–7176, 2017. 3, 8

[57] G. J. van Oldenborgh, P. Yiou, and R. Vautard. On the rolesof circulation and aerosols in the decline of mist and densefog in europe over the last 30 years. Atmospheric Chemistryand Physics, 10(10):4597–4609, 2010. 1, 3

[58] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, andB. Catanzaro. High-resolution image synthesis and semanticmanipulation with conditional gans. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, 2018. 8

[59] D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensorfusion for 3d bounding box estimation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 244–253, 2018. 1, 2, 3, 5

[60] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning ofdriving models from large-scale video datasets. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2174–2182, 2017. 2, 3

[61] B. Yang, W. Luo, and R. Urtasun. Pixor: Real-time 3d ob-ject detection from point clouds. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 7652–7660, 2018. 1

[62] M.-h. Yang, V. M. Patel, J.-s. Choi, S. Kim, B. Chanda,P. Wang, Y. Chen, A. Alvarez-gila, A. Galdran, J. Vazquez-corral, M. Bertalmo, H. S. Demir, and J. Chen. NTIRE 2018Challenge on Image Dehazing : Methods and Results. Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops, pages 1–12, 2018. 2

[63] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madha-van, and T. Darrell. Bdd100k: A diverse driving videodatabase with scalable annotation tooling. arXiv preprintarXiv:1805.04687, 2018. 3

[64] Y. Zhang, P. David, and B. Gong. Curriculum domain adap-tation for semantic segmentation of urban scenes. Proceed-ings of the IEEE International Conference on Computer Vi-sion, pages 2039–2049, 2017. 3

[65] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Resid-ual dense network for image super-resolution. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2472–2481, 2018. 2

[66] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning forpoint cloud based 3d object detection. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 4490–4499, 2018. 1

[67] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2223–2232, 2017. 3

Documents

Abstract arXiv:1902.08913v2 [cs.CV] 10 Feb 2020Mario Bijelic Tobias Gruber Fahim Mannan Florian Kraus Werner Ritter Klaus Dietmayer Felix Heide Abstract The fusion of multimodal sensor