24
U-shape Transformer for Underwater Image Enhancement Lintao Peng 1 , Chunli Zhu 2 , and Liheng Bian Beijing Institute of Technology Abstract. The light absorption and scattering of underwater impurities lead to poor underwater imaging quality. The existing data-driven based underwater image enhancement (UIE) techniques suffer from the lack of a large-scale dataset containing various underwater scenes and high- fidelity reference images. Besides, the inconsistent attenuation in differ- ent color channels and space areas is not fully considered for boosted enhancement. In this work, we constructed a large-scale underwater im- age (LSUI) dataset including 5004 image pairs, and reported an U-shape Transformer network where the transformer model is for the first time introduced to the UIE task. The U-shape Transformer is integrated with a channel-wise multi-scale feature fusion transformer (CMSFFT) module and a spatial-wise global feature modeling transformer (SGFMT) module specially designed for UIE task, which reinforce the network’s attention to the color channels and space areas with more serious attenuation. Meanwhile, in order to further improve the contrast and saturation, a novel loss function combining RGB, LAB and LCH color spaces is de- signed following the human vision principle. The extensive experiments on available datasets validate the state-of-the-art performance of the reported technique with more than 2dB superiority. The dataset and demo code are available on https://lintaopeng.github.io/_pages/ UIE%20Project%20Page.html. Keywords: Underwater image enhancement, Transformer, Multi-color space loss function, Underwater image dataset 1 Introduction Underwater Image Enhancement (UIE) technology[52,44] is essential for obtain- ing underwater images and investigating the underwater environment, which has wide applications in ocean exploration, biology, archaeology, underwater robots[20] and among other fields. However, underwater images frequently have problematic issues, such as color casts, color artifacts and blurred details[45]. Those issues could be explained by the strong absorption and scattering effect on light, which are caused by dissolved impurities and suspended matter in the medium (water). Therefore, UIE-related innovations are of great significance in improving the visual quality and merit of images in accurately understanding the underwater world. Corresponding author: [email protected]

Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater ImageEnhancement

Lintao Peng1, Chunli Zhu2, and Liheng Bian ⋆

Beijing Institute of Technology

Abstract. The light absorption and scattering of underwater impuritieslead to poor underwater imaging quality. The existing data-driven basedunderwater image enhancement (UIE) techniques suffer from the lackof a large-scale dataset containing various underwater scenes and high-fidelity reference images. Besides, the inconsistent attenuation in differ-ent color channels and space areas is not fully considered for boostedenhancement. In this work, we constructed a large-scale underwater im-age (LSUI) dataset including 5004 image pairs, and reported an U-shapeTransformer network where the transformer model is for the first timeintroduced to the UIE task. The U-shape Transformer is integrated witha channel-wise multi-scale feature fusion transformer (CMSFFT) moduleand a spatial-wise global feature modeling transformer (SGFMT) modulespecially designed for UIE task, which reinforce the network’s attentionto the color channels and space areas with more serious attenuation.Meanwhile, in order to further improve the contrast and saturation, anovel loss function combining RGB, LAB and LCH color spaces is de-signed following the human vision principle. The extensive experimentson available datasets validate the state-of-the-art performance of thereported technique with more than 2dB superiority. The dataset anddemo code are available on https://lintaopeng.github.io/_pages/

UIE%20Project%20Page.html.

Keywords: Underwater image enhancement, Transformer, Multi-colorspace loss function, Underwater image dataset

1 Introduction

Underwater Image Enhancement (UIE) technology[52,44] is essential for obtain-ing underwater images and investigating the underwater environment, whichhas wide applications in ocean exploration, biology, archaeology, underwaterrobots[20] and among other fields. However, underwater images frequently haveproblematic issues, such as color casts, color artifacts and blurred details[45].Those issues could be explained by the strong absorption and scattering effecton light, which are caused by dissolved impurities and suspended matter in themedium (water). Therefore, UIE-related innovations are of great significance inimproving the visual quality and merit of images in accurately understandingthe underwater world.⋆ Corresponding author: [email protected]

Page 2: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

2 Lintao Peng, Chunli Zhu, and Liheng Bian

PSNR 14.62

Input OursUIBLA [40] FUnIE [20] Ucolor [26] Reference

PSNR 20.15 PSNR 17.58 PSNR 22.06 PSNR 26.96

Fig. 1. Compared with the existing UIE methods, the image produced by our U-shapeTransformer has the highest PSNR[23] score and best visual quality.

In general, the existing UIE methods can be categorized into three types,which are physical model-based, visual prior-based and data-driven methods,respectively. Among them, visual prior-based UIE methods [2,24,14,12,18,19]mainly concentrated on improving the visual quality of underwater images bymodifying pixel values from the perspectives of contrast, brightness and sat-uration. Nevertheless, the ignorance of the physical degradation process lim-its the improvement of enhancement quality. In addition, physical-model basedUIE methods [16,9,8,50,40,25,5,13,29] mainly focus on the accurate estimation ofmedium transmission. With the estimated medium transmission and other keyunderwater imaging parameters such as the homogeneous background light, aclean image can be obtained by reversing a physical underwater imaging model.However, the performance of physical model-based UIE is restricted to com-plicated and diverse real-world underwater scenes. That is because, (1) modelhypothesis is not always plausible with complicated and dynamic underwater envi-ronment ; (2) evaluating multiple parameters simultaneously is challenging. Morerecently, as to the data-driven methods[15,20,31,26,30,51,1,11,28,46,10,47,45],which could be regarded as deep learning technologies in UIE domain, exhibitimpressive performance on UIE task. However, the existing underwater datasetsmore-or-less have the disadvantages, such as a small number of images, few un-derwater scenes, or even not real-world scenarios, which limits the performanceof the data-driven UIE method. Besides, the inconsistent attenuation of the un-derwater images in different color channels and space areas have not been unifiedin one framework.

In this work, we first built a large scale underwater image (LSUI) dataset,which covers more abundant underwater scenes and better visual quality refer-ence images than existing underwater datasets[34,28,1,31]. The dataset contains5004 real-world underwater images, and the corresponding clear images are gen-erated as comparison references. We also provide the semantic segmentationmap and medium transmission map for each image. Furthermore, with the priorknowledge that the attenuation of different color channels and space areas in un-derwater images is inconsistent, we designed a channel-wise multi-scale featurefusion transformer (CMSFFT) and a spatial-wise global feature modeling trans-former (SGFMT) based on the attention mechanism, and embedded them in ourU-shape Transformer which is designed based on [21]. Moreover, according tothe color space selection experiment shown in the supplementary material and[32,26], we designed a multi-color space loss function including RGB, LAB and

Page 3: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 3

LCH color space. Fig. 7 shows the result of our UIE method and some compari-son UIE methods, and the main contributions of this paper can be summarizedas follows:

– We reported a novel U-shape Transformer dealing with the UIE task, inwhich the designed channel-wise and spatial-wise attention mechanism en-ables to effectively remove color artifacts and casts.

– We designed a novel multi-color space loss function combing the RGB, LCHand LAB color-space features, which further improves the contrast and sat-uration of output images.

– We released a large-scale dataset containing 5004 real underwater imagesand the corresponding high-quality reference images, semantic segmentationmaps, and medium transmission maps, which facilitates further developmentof UIE techniques.

2 Related work

2.1 Data-driven UIE Methods

As we mentioned the pros and cons of physical model-based and visual prior-based UIE methods in section 1, this part concerns only data-driven UIE meth-ods.

Current data-driven UIE methods can be divided into two main technicalroutes, (1) designing an end-to-end module; (2) utilizing deep models directlyto estimate physical parameters, and then restore the clean image based on thedegradation model. To alleviate the need for real-world underwater paired train-ing data, Li et al. [31] proposed a WaterGAN to generate underwater-like imagesfrom in-air images and depth maps in an unsupervised manner, in which the gen-erated dataset is further used to train the WaterGAN. Moreover, [30] exhibited aweakly supervised underwater color transmission model based on CycleGAN[58].Benefiting from the adversarial network architecture and multiple loss functions,that network can be trained using unpaired underwater images, which refines theadaptability of the network model to underwater scenes. However, images in thetraining dataset used by the above methods are not matched real underwater im-ages, which leads to limited enhancement effects of the above methods in diversereal-world underwater scenes. Recently, Li et al.[28] proposed a gated fusion net-work named WaterNet, which uses gamma-corrected images, contrast-improvedimages, and white-balanced images as the inputs to enhance underwater images.Yang et al.[53] proposed a conditional generative adversarial network (cGAN)to improve the perceptual quality of underwater images.

The methods mentioned above usually use existing deep neural networks forgeneral purposes directly on UIE tasks and neglect the unique characteristics ofunderwater imaging. For example, [30] directly used the CycleGAN [58] networkstructure, and [28] adopted a simple multi-scale convolutional network. Othermodels such as UGAN [10],WaterGAN[31] and cGAN[53], still inherited the

Page 4: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

4 Lintao Peng, Chunli Zhu, and Liheng Bian

disadvantage of GAN-based models, which produces unstable enhancement re-sults. In addition, Ucolor [26] combined the underwater physical imaging modeland designed a medium transmission guided model to reinforce the network’sresponse to areas with more severe quality degradation, which could improvethe visual quality of the network output to a certain extent. However, physicalmodels sometimes failed with varied underwater environments.

From above, our proposed network aims at generating high visual qualityunderwater images by properly accounting the inconsistent attenuation charac-teristics of underwater images in different color channels and space areas.

2.2 Underwater Image Datasets

The sophisticated and dynamic underwater environment results in extreme diffi-culties in the collection of matched underwater image training data in real-worldunderwater scenes. Present datasets can be classified into two types, they are,(1) Non-reference datasets. Liu et al. [34] proposed the RUIE dataset, whichencompasses varied underwater lighting, depth of field, blurriness and color castscenes. Akkaynak et al.[1] published a non-reference underwater dataset with astandard color comparison chart. Those datasets, however, cannot be used forend-to-end training for lacking matched clear reference underwater images. (2)Full-reference datasets. Li et al.[31] presented an unsupervised network dubbedWaterGAN to produce underwater-like images using in-air images and depthmaps. Similarly, Fabbri et al.[10] used CycleGAN to generate distorted imagesfrom clean underwater images based on weakly-supervised distribution transfer.However, these methods rely heavily on training samples, which is easy to pro-duce artifacts that are out of reality and unnatural. Li et al.[28] constructed areal UIE benchmark UIEB, including 890 images pairs, in which reference imageswere hand-crafted using the existing optimal UIE methods. Although those im-ages are authentic and reliable, the number, content and coverage of underwaterscenes are limited. In contrast, our LSUI dataset contains 5004 real underwaterimages pairs, which has abundant underwater environments and higher visualquality references.

2.3 Transformers

Although CNN-based UIE methods [28,20,10,47,26] achieved significant improve-ment compared with traditional UIE methods. There are still two aspects thatlimit its further promotion, (1) uniform convolution kernel is not able to charac-terize the inconsistent attenuation of underwater images in different color chan-nels and spatial regions; (2) the CNN architecture concerns more on local fea-tures, while ineffective for long-dependent and global feature modeling.

Recently, transformer [48] has gained more and more attention, its content-based interactions between image content and attention weights can be inter-preted as spatially varying convolution, and the self-attention mechanism is goodat modeling long-distance dependencies and global features. Benefiting fromthese advantages, transformers have shown outstanding performance in several

Page 5: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 5

vision tasks [7,35,57,56,36]. Compared with previous CNN-based UIE networks,our CMSFFT and SGFMT modules designed based on the transformer can guidethe network to pay more attention to the more serious attenuated color channelsand spatial areas. Moreover, by combining CNN with transformer, we achievebetter performance with a relatively small amount of parameters.

3 Proposed dataset and method

3.1 LSUI Dataset

Data Collection. We have collected 8018 underwater images, some of themare collected by ourselves, and some of them are sourced from existing publicdatasets [34,1,31,10,28] (All images have been licensed and used only for aca-demic purposes). Real underwater images with rich water scenes, water types,lighting conditions and target categories, are selected to the extent possible, forfurther generating clear reference images.Reference Image Generation. The reference images were selected with tworound subjective and objective evaluations, to eliminate the potential bias tothe extent possible. In the first round, inspired by ensemble learning [41] thatmultiple weak classifiers could form a strong one, we firstly use 18 existing op-timal UIE methods [2,12,40,9,8,25,5,13,29,20,31,30,51,11,46,47,42,37] to processthe collected underwater images successively, and a set with 18 ∗ 9098 images isgenerated for the next-step optimal reference dataset selection. Unlike [28], toreduce the number of images that need to be selected manually, non-referencemetrics UIQM [39] and UCIQE [54] are adopted to score all generated imageswith equal weights. Then, the top-three reference images of each original oneform a set with the size 3 ∗ 8018. Considering individual differences, 20 volun-teers with image processing experience were invited to rate images according to5 most important judgments (contrast; saturation; color correction effects; arti-facts degree; over or under-enhancement degree) of UIE tasks with a score from0-10, where the higher score represents the more contentedness. And the totalscore of each reference picture is 100 (5 ∗ 20) after normalizing each score to 0-1.The top-one reference image of each raw underwater image was chosen with thehighest summation. In addition, images with the highest summation lower than70 have been removed from the dataset.

After the first round, some of the generated reference images still have prob-lems such as blur, color cast and noise. So in the second round, we invitedvolunteers to vote on each reference picture again to select its existing problemsand determine the corresponding optimization method, and then use appropri-ate image enhancement methods [56,33,55] to process it. Next, all volunteerswere invited to conduct another round of voting to remove image pairs thatmore than half of the volunteers were dissatisfied with. To improve the utilityof the LSUI dataset, we also hand-labeled a segmentation map and generated amedium transmission map for each image. Eventually, our LSUI dataset contains5004 images and the corresponding high-quality reference images, semantic seg-mentation maps, and medium transmission maps for each image. Compared with

Page 6: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

6 Lintao Peng, Chunli Zhu, and Liheng Bian

+

PE

Transformer Layer

Transformer Layer

Feature Mapping

LinearProjection

SGFMT

……

……+

PE

LinearProjection

Feature Mapping

CM

HA

CM

HA

CMSFFT

Input

Layer Norm

Multi-H

eadA

ttention

+

Layer Norm

MLP +

PE Position Embedding

Multi-Layer PerceptronMLPCMHA Channel-wise Multi-Head Attention

2 Times Down Sampling1*1 Convolution

Element-wise Addition+

GT

Output

Fig. 2. The network structure of U-shape Transformer. CMSFFT and SGFMT modulesspecially designed for UIE tasks reinforce the network’s attention to the more severelyattenuated color channels and spatial regions. The multi-scale connections of the gen-erator and the discriminator make the gradient flow freely between the generator andthe discriminator, therefore making the training process more stable.

existing underwater datasets [28,31,34,31], it contains more underwater scenes,biological categories and water types.

3.2 U-shape Transformer

Overall Architecture The overall architecture of the U-shape Transformer isshown as Fig. 8, which includes a CMSFFT & SGFMT based generator and adiscriminator.

In the generator, (1) Encoding: Except being directly input to the network,the original image will be downsampled three times respectively. Then after1*1 convolution, the three scale feature maps are input into the correspond-ing scale convolution block. The outputs of four convolutional blocks are theinputs of the CMSFFT and SGFMT; (2) Decoding: After feature remapping,the SGFMT output is directly sent to the first convolutional block. Meanwhile,four convolutional blocks with varied scales will receive the four outputs fromCMSFFT.

In the discriminator, the input of the four convolutional blocks includes: thefeature map output by its own upper layer, the feature map of the correspondingsize from the decoding part and the feature map generated by 1 ∗ 1 convolutionafter downsampling to the corresponding size using the reference image. With thedescribed multi-scale connections, the gradient flow can flow freely on multiplescales between the generator and the discriminator, such that a stable trainingprocess could be obtained, details of the generated images could be enriched. The

Page 7: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 7

detailed structure of SGFMT and CMSFFT in the network will be described inthe following two subsections.

Layer Norm

Multi-H

ead Attention

+

Layer Norm

+

MLP

Layer Norm

Multi-H

ead Attention

+

Layer Norm

+

MLP

Layer Norm

Multi-H

ead Attention

+

Layer Norm

+

MLP

Layer Norm

Multi-H

ead Attention

+

Layer Norm

+

MLP

Layer Norm

Multi-H

ead Attention

+

Layer Norm

+

MLP

Layer Norm

Multi-H

ead Attention

+ +

MLP

512

Linear Projection

512

Position Embedding

512 512

FeatureMapping

H16

* W16

*512 HW256

*512 HW256

*512H16

* W16

*512

F𝑖𝑖𝑖𝑖 𝑆𝑆𝑖𝑖𝑖𝑖 𝑆𝑆𝑙𝑙 F𝑜𝑜𝑜𝑜𝑜𝑜

Layer Norm

Fig. 3. Data flow diagram of the SGFMT module.

SGFMT The SGFMT (as shown in Fig. 9) is used to replace the originalbottleneck layer of the generator, which can assist the network to model theglobal information and reinforce the network’s attention on severely degradedparts. Assuming the size of the input feature map is Fin ∈ RH

16∗W16 ∗C .

For the expected one-dimensional sequence of the transformer, linear projec-tion is used to stretch the two-dimensional feature map into a feature sequenceSin ∈ RHW

256 ∗ C . For preserving the valued position information of each region,learnable position embedding is merged directly, which can be expressed as,

Sin = W ∗ Fin + PE, (1)

where W ∗ Fi represents a linear projection operation, PE represents a positionembedding operation.

Then we input the feature sequence Sin to the transformer block, whichcontains 4 standard transformer layers[48]. Each transformer layer contains amulti-head attention block (MHA) and a feed-forward network (FFN). The FFNincludes a normalization layer and a fully connected layer. The output of the l-th(l ∈ [1, 2, . . . ., l]) layer in the transformer block can be calculated by,

S′

l = MHA(LN(Sl−1)) + Sl−1, Sl = FFN(LN(S′

l )) + S′

l , (2)

where LN represents layer normalization, and Sl represents the output sequenceof the l-th layer in the transformer block. The output feature sequence of thelast transformer block is Sl ∈ RHW

256 ∗ C , which is restored to the feature map ofFout ∈ RH

16∗W16 ∗C after feature remapping.

CMSFFT To reinforce the network’s attention on the more serious attenua-tion color channels, inspired by [49], we designed the CMSFFT block to replacethe skip connection of the original generator’s encoding-decoding architecture(Fig.10), which consists of the following three parts.

Page 8: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

8 Lintao Peng, Chunli Zhu, and Liheng Bian

Linear Projection

Linear Projection

Linear Projection

Linear Projection

Position Embedding

Position Embedding

Position Embedding

Position Embedding

Feature Mapping

Feature Mapping

Feature Mapping

Feature Mapping

H* W*𝐶𝐶1

𝐻𝐻2* 𝑊𝑊2

*𝐶𝐶2

𝐻𝐻4* 𝑊𝑊4

*𝐶𝐶3

𝐻𝐻8* 𝑊𝑊8

*𝐶𝐶4

d*𝐶𝐶1

d*𝐶𝐶2

d*𝐶𝐶3

d*𝐶𝐶4

𝐶𝐶1

𝐶𝐶2

𝐶𝐶3

𝐶𝐶4

12

d

1 2

d

1 2d

1 2d

C=𝐶𝐶1 + 𝐶𝐶2 + 𝐶𝐶3 + 𝐶𝐶4

MatM

ul

Scale

Instance Norm

SoftMax

MatM

ul

𝑄𝑄𝑖𝑖𝑇𝑇

𝑄𝑄1

𝑄𝑄2

𝑄𝑄3

𝑄𝑄4

Concat

+

+

+

+

K𝑉𝑉𝑇𝑇LN

LNLN

LN

LNLN

LNLN

MLP

MLP

MLP

MLP

𝑂𝑂1

𝑂𝑂2

𝑂𝑂3

𝑂𝑂4

LN Layer Normalization MLP Multi-Layer Perceptron Concat Channel-Wise ConcatenationElement-wise Addition+

Channel-wise Multi-Head Attention

Channel-wise Attention

Fig. 4. Detailed structure of the CMSFFT module.

Multi-Scale Feature Encoding. The inputs of CMSFFT are the feature maps

Fi ∈ RH

2i∗W

2i∗Ci(i = 0, 1, 2, 3) with different scales. Differs from the linear pro-

jection in Vit[6] which is applied directly on the partitioned original image,we use convolution kernels with related filter size P

2i ∗ P2i (i = 0, 1, 2, 3) and

step size P2i (i = 0, 1, 2, 3), to conduct linear projection on feature maps with

varied scales. In this work, P is set as 32. After that, four feature sequenceSi ∈ Rd∗ Ci(i = 1, 2, 3, 4) could be obtained, where d ∈ HW

P 2 . Those four con-volution kernels divide feature maps into the same number of blocks, while thenumber of channels Ci(i = 1, 2, 3, 4) remains unchanged. Then, four query vec-tors Qi ∈ Rd∗ Ci(i = 1, 2, 3, 4), K ∈ Rd∗ C and V ∈ Rd∗ C can be obtained byEq.(3).

Qi = SiWQi , K = SWK , V = SWV , (3)

where WQi ∈ Rd∗ Ci(i = 1, 2, 3, 4), WK ∈ Rd∗ C and WV ∈ Rd∗ C stands forlearnable weight matrices; S is generated by concatenating Si ∈ Rd∗ Ci(i =1, 2, 3, 4) via the channel dimension, where C = C1+C2+C3+C4. In this work,C1, C2, C3, and C4 are set as 64, 128, 256, 512, respectively.Channel-Wise Multi-Head Attention(CMHA). The CMHA block has sixinputs, which are K ∈ Rd∗ C , V ∈ Rd∗ C and Qi ∈ Rd∗ Ci(i = 1, 2, 3, 4). Theoutput of channel-wise attention CAi ∈ RCi∗ d(i = 1, 2, 3, 4) could be obtainedby,

CAi = SoftMax(IN(QT

i K2√C

))V T , (4)

where IN represents the instance normalization operation. This attention opera-tion performs along the channel-axis instead of the classical patch-axis[6], whichcan guide the network to pay attention to channels with more severe imagequality degradation. In addition, IN is used on the similarity maps to assist thegradient flow spreads smoothly.

The output of the i-th CMHA layer can be expressed as,

CMHAi = (CAi1 +CAi

2 + .......,+CAiN )/N +Qi, (5)

Page 9: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 9

where N is the number of heads, which is set as 4 in our implementation.Feed-Forward Network(FFN). Similar to the forward propagation of [6], theFFN output can be expressed as,

Oi = CMHAi +MLP(LN(CMHAi)), (6)

where Oi ∈ Rd∗ Ci(i = 1, 2, 3, 4); MLP stands for multi-layer perception. Here,The operation in Eq. (6) needs to be repeated l (l=4 in this work) times insequence to build the l-layer transformer.

Finally, feature remappings are performed on the four different output featuresequences Oi ∈ RCi∗ d(i = 1, 2, 3, 4) to reorganize them into four feature maps

Fi ∈ RH

2i∗W

2i∗Ci(i = 0, 1, 2, 3) , which are the input of convolutional block in the

generator’s decoding part.

3.3 Loss Function

To take advantage of the LAB and LCH color spaces’ wider color gamut rep-resentation range and more accurate description of the color saturation andbrightness, we designed a multi-color space loss function combining RGB, LABand LCH color spaces to train our network. The image from RGB space is firstlyconverted to LAB and LCH space, and reads,

LG(x), AG(x), BG(x) = RGB2LAB(G(x)), Ly,Ay,By = RGB2LAB(y) (7)

LG(x), CG(x), HG(x) = RGB2LCH(G(x)), Ly,Cy,Hy = RGB2LCH(y) (8)

where x, y and G(x) represents the original inputs, the reference image, and theclear image output by the generator, respectively.

Loss functions in the LAB and LCH space are written as Eq.(9) and Eq.(10).

LossLAB(G(x), y) = Ex,y[(Ly − LG(x))

2−

n∑i=1

Q(Ayi )log(Q(A

G(x)i ))−

n∑i=1

Q(Byi )log(Q(B

G(x)i ))],

(9)

LossLCH(G(x), y) = Ex,y[−n∑

i=1

Q(Lyi )log(Q(L

G(x)i ))

+ (Cy − CG(x))2+ (Hy −HG(x))

2],

(10)

where Q stands for the quantization operator.L2 loss in the RGB color space LossRGB and the perceptual loss Lossper[22]

, as well as LossLAB and LossLCH are the four loss functions for the generator.Besides, standard GAN loss function is introduced for minimizing the loss

between generated and reference pictures, and written as,

LGAN (G,D) = Ey[logD(y)] + Ex[log(1−D(G(x)))], (11)

Page 10: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

10 Lintao Peng, Chunli Zhu, and Liheng Bian

where D represents the discriminator. D aims at maximizing LGAN (G,D), toaccurately distinguish the generated image from the reference image. And thegoal of generator G is to minimize the loss between generated pictures andreference pictures. Then, the final loss function is expressed as,

G∗ = argminG

maxD

LGAN (G,D) + αLossLAB(G(x), y)

+ βLossLCH(G(x), y) + γLossRGB(G(x), y) + µLossper(G(x), y),(12)

where α, β, γ, µ are hyperparameters, which are set as 0.001, 1, 0.1, 100, respec-tively, with numerous experiments.

4 Experiments

4.1 Experiment Settings

Benchmarks. The LSUI dataset was randomly divided as Train-L (4500 im-ages) and Test-L504 (504 images) for training and testing, respectively. Thetraining set was enhanced by cropping, rotating and flipping the existing images.All images were adjusted to a fixed size (256*256) when input to the network, andthe pixel value will be normalized to [0,1]. Besides Train-L, the second trainingset Train-U contains 800 pairs of underwater images from UIEB[28] and 1,250synthetic underwater images from [27]; the third training set Train-E containsthe paired training images in the EUVP[20] dataset. Testing datasets are catego-rized into two types, (1) full-reference testing dataset: Test-L504 and Test-U90(remaining 90 pairs in UIEB); (2) non-reference testing dataset: Test-U60 andSQUID. Here, Test-U60 includes 60 non-reference images in UIEB; 16 picturesfrom SQUID[1] forms the second non-reference testing dataset.

Compared Methods. We compare U-shape Transformer with 10 UIE meth-ods to verify our performance superiority. It includes two physical-based mod-els (UIBLA[40], UDCP[9]), three visual prior-based methods (Fusion[2], retinexbased[12], RGHS[18]), and five data-driven methods (WaterNet[28], FUnIE[20],UGAN[10], UIE-DAL[47], Ucolor[26]).

Evaluation Metrics. For the testing dataset with reference images, we con-ducted full-reference evaluations using PSNR[23] and SSIM[17] metrics. Thosetwo metrics reflect the proximity to the reference, where a higher PSNR valuerepresents closer image content, and a higher SSIM value reflects a more similarstructure and texture. For images in the non-reference testing dataset, non-reference evaluation metrics UCIQE[54] and UIQM[39] are employed, in whichhigher UCIQE or UIQM score suggests better human visual perception. ForUCIQE and UIQM cannot accurately measure the performance in some cases[28] [3], we also conducted a survey following [26], which results are stated as“perception score (PS)”. PS ranges from 1-5, with higher scores indicating higherimage quality. Moreover, NIQE [38], which lower value represents a higher visualquality, is also adopted as the metrics.

Page 11: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 11

Table 1. Dataset evaluation results. The highest PSNR and SSIM scores are markedin red.

MethodsTrainingData

Test-U90 TestL-504PSNR SSIM PSNR SSIM

U-net[43]Train-U 17.07 0.76 19.19 0.79Train-E 17.46 0.76 19.45 0.78

Ours 20.14 0.81 20.89 0.82

UGAN[10]Train-U 20.71 0.82 19.89 0.79Train-E 20.72 0.82 19.82 0.78

Ours 21.56 0.83 21.74 0.84

OursTrain-U 21.25 0.84 22.87 0.85Train-E 21.75 0.86 23.01 0.87

Ours 22.91 0.91 24.16 0.93

4.2 Dataset Evaluation

The effectiveness of LSUI is evaluated by retraining the compared methods (U-net[43], UGAN[10] and U-shape Transformer) on Train-L, Train-U and Train-E.The trained network was tested on Test-L504 and Test-U90. As shown in Tab.5,the model trained on our dataset is the best of PSNR and SSIM. It could beexplained that LSUI contains richer underwater scenes and better visual qualityreference images than existing underwater image datasets, which could improvethe enhancement and generalization ability of the tested network. Related visualcomparisons are attched in the supplementary material.

4.3 Network Architecture Evaluation

Full-Reference Evaluation. The Test-L504 and Test-U90 datasets were usedfor evaluation. The statistical results and visual comparisons are summarized inTab. 2 and Fig. 11. We also provide the running time (image size is 256*256)of all UIE methods in Tab. 2, as well as the FLOPs and parameter amountof each data-driven UIE method. And we retrianed the 5 open-sourced deeplearning-based UIE methods on our dataset.As in Tab.2, our U-shape Transformer demonstrates the best performance onboth PSNR and SSIM metrics with relatively few parameters, FLOPs, and run-ning time. The potential limitations of the performance of the 5 data-drivenmethods are analyzed as follows. The strength of FUnIE[20] lies in achieving fast,lightweight,and fewer parameter models, while naturally limits its scalability oncomplex and distorted testing samples. UGAN[10] and UIE-DAL[47] did notconsider the inconsistent characteristics of the underwater images. Ucolor’s me-dia transmission map prior can not effectively represent the attenuation of eacharea, and simply introducing the concept of multi-color space into the network’sencoder part cannot effectively take advantage of it, which causes unsatisfactoryresults in terms of contrast, brightness, and detailed textures.

The visual comparisons shown in Fig. 11 reveal that enhancement resultsof our method are the closest to the reference image, which has fewer color

Page 12: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

12 Lintao Peng, Chunli Zhu, and Liheng Bian

Table 2. Quantitative comparison among different UIE methods on the full-referencetesting set. The highest scores of PSNR and SSIM are marked in red, and all UIE meth-ods are tested on a PC with an INTEL(R) I5-10500 CPU, 16.0GB RAM, a NVIDIAGEFORCE RTX 1660 SUPER.

MethodsTest-L504 Test-U90

FLOPs↓ #param.↓ time↓PSNR↑ SSIM↑ PSNR↑ SSIM↑

UIBLA[40] 13.54 0.71 15.78 0.73 × × 42.13s

UDCP[9] 11.89 0.59 13.81 0.69 × × 30.82s

Fusion[2] 17.48 0.79 19.04 0.82 × × 6.58s

Retinex based[12] 13.89 0.74 14.01 0.72 × × 1.06s

RGHS[18] 14.21 0.78 14.57 0.79 × × 8.92s

WaterNet[28] 17.73 0.82 19.81 0.86 193.7G 24.81M 0.61s

FUnIE[20] 19.37 0.84 19.45 0.85 10.23G 7.019M 0.09s

UGAN[10] 19.79 0.78 20.68 0.84 38.97G 57.17M 0.05s

UIE-DAL[47] 17.45 0.79 16.37 0.78 29.32G 18.82M 0.07s

Ucolor[26] 22.91 0.89 20.78 0.87 443.85G 157.4M 2.75s

Ours 24.16 0.93 22.91 0.91 66.2G 65.6M 0.07s

artifacts and high-fidelity object areas. Five selected methods tend to producecolor artifacts that deviated from the original color of the object. Among themethods, UIBLA[40] exhibits severe color casts. Retinex based[12] could improvethe image contrast to a certain extent, but cannot remove the color casts andcolor artifacts effectively. The enhancement result of FUnLE[20] is yellowishand reddish overall. Although UGAN[10] and Ucolor[28] could provide relativelygood color appearance, they are often affected by local over-enhancement, andthere are still some color casts in the result.

Table 3. Quantitative comparison among different UIE methods on the non-referencetesting set. The highest scores are marked in red.

MethodsTest-U60 SQUID

PS↑ UIQM↑ UCIQE↑ NIQE↓ PS↑ UIQM↑ UCIQE↑ NIQE↓input 1.46 0.82 0.45 7.16 1.23 0.81 0.43 4.93

UIBLA[40] 2.18 1.21 0.60 6.13 2.45 0.96 0.52 4.43

UDCP[9] 2.01 1.03 0.57 5.94 2.57 1.13 0.51 4.47

Fusion[2] 2.12 1.23 0.61 4.96 2.89 1.29 0.61 5.01

Retinex based[12] 2.04 0.94 0.69 4.95 2.33 1.01 0.66 4.86

RGHS[18] 2.45 0.66 0.71 4.82 2.67 0.82 0.73 4.54

WaterNet[28] 3.23 0.92 0.51 6.03 2.72 0.98 0.51 4.75

FUnIE[20] 3.12 1.03 0.54 6.12 2.65 0.98 0.51 4.67

UGAN[10] 3.64 0.86 0.57 6.74 2.79 0.90 0.58 4.56

UIE-DAL[47] 2.03 0.72 0.54 4.99 2.21 0.79 0.57 4.88

Ucolor[26] 3.71 0.84 0.53 6.21 2.82 0.82 0.51 4.32

Ours 3.91 0.85 0.73 4.74 3.23 0.89 0.67 4.24

Page 13: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 13

OursUcolor [26]UGAN [10]FUnIE [20]Retinex [12]UIBLA [40]Input Ground truth

PSNR 15.58 PSNR 17.92 PSNR 23.14 PSNR 18.12 PSNR 21.42 PSNR 22.03 PSNR 26.80

PSNR 15.06 PSNR 16.81 PSNR 16.36 PSNR 19.03 PSNR 18.81 PSNR 22.88 PSNR 32.47

PSNR 13.46 PSNR 14.14 PSNR 20.99 PSNR 15.84 PSNR 18.06 PSNR 19.65 PSNR 24.38

PSNR 16.80 PSNR 19.01 PSNR 18.12 PSNR 22.06 PSNR 26.40

PSNR 13.51 PSNR 9.71 PSNR 20.21 PSNR 16.76 PSNR 19.47 PSNR 21.86 PSNR 24.11

PSNR 19.24 PSNR 20.29

Fig. 5. Visual comparison of enhancement results sampled from the Test-L504(LSUI)and Test-U90(UIEB[28]) dataset. We regard the reference picture as ground truth (GT)to calculate PSNR. More examples can be found in the supplementary material.

Non-reference Evaluation. The Test-U60 and SQUID datasets were utilizedfor the non-reference evaluation, in which statistical results and visual compar-isons are shown in Tab. 3 and Fig. 6 (a).

As in Tab. 3, our method achieved the highest scores on PS and NIQE met-rics, which confirmed the initial idea to contemplate the human eye’s color per-ception and better generalization ability to varied real-world underwater scenes.Note that UCIQE and UIQM of all deep learning-based UIE methods are weakerthan physical model-based or visual prior-based, also reported in [26]. Those twometrics are of valuable reference, but cannot as absolute justifications [28][3], forthey are non-sensitive to color artifacts & casts and biased to some features.

As in Fig. 6 (a), UIBLA [40], Retinex [12], FUnIE [20], UGAN [10] and Ucolor[26] exhibit the same problem as in Fig. 11. UIE-DAL [47] exhibits obviousartifacts and color casts. That is because UIE-DAL ignoring the inconsistentattenuation characteristics of underwater image. In our method, the reportedCMSFFT and SGFMT modules could reinforce the network’s attention to thecolor channels and spatial regions with serious attenuation, therefore obtaininghigh visual quality enhancement results without artifacts and color casts.

4.4 Ablation Study

To prove the effectiveness of each component, we conduct a series of ablationstudies on the Test-L504 and Test-U90. Four factors are considered includingthe CMSFFT, the SGFMT, the multi-scale gradient flow mechanism (MSG),and the multi-color space loss function (MCSL).

Experiments are all trained by Train-L. Statistical results are shown in Tab.4, in which baseline model (BL) refers to [21], full models is the complete U-

Page 14: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

14 Lintao Peng, Chunli Zhu, and Liheng Bian

Table 4. Statistical results of ablation study on the Test-L504 and the Test-U90. Thehighest scores are marked in red.

ModelsTest-L504 Test-U90

PSNR SSIM PSNR SSIM

BL 19.34 0.79 19.36 0.81

BL+CMSFFT 22.47 0.88 21.72 0.86

BL+SGFMT 21.78 0.86 21.36 0.87

BL+MSG 20.11 0.82 21.24 0.85

BL+MCSL 21.51 0.82 20.16 0.81

Full Model 24.16 0.93 22.91 0.91

PS 1.47

OursUcolor [26]UIE-DAL[47]UGAN [10]FUnIE [20]Retinex [12]UIBLA [40]Input

PS 1.81 PS 1.31 PS 2.03 PS 2.14 PS 1.42 PS 3.84 PS 3.95

PS 1.34 PS 1.83 PS 1.25 PS 1.94 PS 2.12 PS 1.71 PS 2.91 PS 3.23

PSNR 15.73 PSNR 16.54 PSNR 17.52 PSNR 24.49 PSNR 19.75 PSNR 19.21 PSNR 28.67

(a)

(b)

Input BL Full ModelBL+SGFMT BL+MCSLBL+CMSFFT Ground truthBL+MSG

Fig. 6. (a) Visual comparison of the non-reference evaluation sampled from the Test-U60(UIEB[28]) dataset. (b) Visual comparison of the ablation study sampled from theTest-L504 dataset.

shape Transformer. In Tab. 4, our full model achieves the best quantitativeperformance on the two testing dataset, which reflects the effectiveness of thecombination of CMSFFT, SGFMT, MSG, and MCSL modules. As in Fig .6(b), the enhancement result of the full model has the highest PSNR and bestvisual quality. The results of BL+MSG have less noise and artifacts than the BLmodule because the MSG mechanism helps to reconstruct local details. Thanksto the multi-color space loss function, the overall color of BL+MCSL’s result isclose to the reference image. The unevenly distributed visualization and artifactsin local areas of BL+MCSL are due to the lack of efficient attention guidance.Although the enhanced results of BL+CMSFFT and BL+SGFMT are evenlydistributed, the overall color is not accurate. The investigated four moduleshave their particular functionality in the enhancement process, which integrationcould improve the overall performance of our network.

5 Conclusions

In this work, we released the LSUI dataset which is the largest real-world un-derwater dataset with high-fidelity reference images. Besides, we reported anU-shape Transformer network for state-of-the-art enhancement. The network’s

Page 15: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 15

CMSFFT and SGFMT modules could solve the inconsistent attenuation issue ofunderwater images in different color channels and space regions, which has notbeen considered among existing methods. Extensive experiments validate thesuperior ability of the network to remove color artifacts and casts. Combinedwith the multi-color space loss function, the contrast and saturation of outputimages are further improved. Nevertheless, it is impossible to collect images ofall the complicated scenes such as deep-ocean low-light scenarios. Therefore, wewill introduce other general enhancement techniques such as low-light boosting[4] for future work.

Page 16: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

16 Lintao Peng, Chunli Zhu, and Liheng Bian

References

1. Akkaynak, D., Treibitz, T.: Sea-thru: A method for removing wa-ter from underwater images. In: CVPR. pp. 1682–1691 (2019).https://doi.org/10.1109/CVPR.2019.00178 2, 4, 5, 10

2. Ancuti, C., Ancuti, C.O., Haber, T., Bekaert, P.: Enhancing under-water images and videos by fusion. In: CVPR. pp. 81–88 (2012).https://doi.org/10.1109/CVPR.2012.6247661 2, 5, 10, 12

3. Berman, D., Levy, D., Avidan, S., Treibitz, T.: Underwater single image colorrestoration using haze-lines and a new quantitative dataset. IEEE TPAMI 43(8),2822–2837 (2021). https://doi.org/10.1109/TPAMI.2020.2977624 10, 13

4. Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: CVPR.pp. 3291–3300 (2018). https://doi.org/10.1109/CVPR.2018.00347 15

5. Chiang, J.Y., Chen, Y.C.: Underwater image enhancement by wave-length compensation and dehazing. IEEE TIP 21(4), 1756–1769 (2012).https://doi.org/10.1109/TIP.2011.2179666 2, 5

6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:An image is worth 16x16 words: Transformers for image recognition at scale. ArXivabs/2010.11929 (2021) 8, 9

7. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image isworth 16x16 words: Transformers for image recognition at scale. arXiv preprintarXiv:2010.11929 (2020) 5

8. Drews, P.L., Nascimento, E.R., Botelho, S.S., Montenegro Campos, M.F.: Under-water depth estimation and image restoration based on single images. IEEE Com-put. Graph. Appl. 36(2), 24–35 (2016). https://doi.org/10.1109/MCG.2016.26 2,5

9. Drews Jr, P., do Nascimento, E., Moraes, F., Botelho, S., Campos, M.: Transmis-sion estimation in underwater single images. In: ICCV workshops. pp. 825–830(2013). https://doi.org/10.1109/ICCVW.2013.113 2, 5, 10, 12

10. Fabbri, C., Islam, M.J., Sattar, J.: Enhancing underwater imagery using generativeadversarial networks. ICRA pp. 7159–7165 (2018) 2, 3, 4, 5, 10, 11, 12, 13, 22, 23,24

11. Fu, X., Fan, Z., Ling, M., Huang, Y., Ding, X.: Two-step approach forsingle underwater image enhancement. In: ISPACS. pp. 789–794 (2017).https://doi.org/10.1109/ISPACS.2017.8266583 2, 5

12. Fu, X., Zhuang, P., Huang, Y., Liao, Y., Zhang, X.P., Ding, X.: A retinex-basedenhancing approach for single underwater image. In: ICIP. pp. 4572–4576 (2014).https://doi.org/10.1109/ICIP.2014.7025927 2, 5, 10, 12, 13, 22, 23, 24

13. Galdran, A., Pardo, D., Picon, A., Alvarez-Gila, A.: Automatic red-channel under-water image restoration. JVCIR 26, 132–145 (2015) 2, 5

14. Ghani, A.S.A., Isa, N.A.M.: Underwater image quality enhancement through com-position of dual-intensity images and rayleigh-stretching. In: ICCE. pp. 219–220(2014). https://doi.org/10.1109/ICCE-Berlin.2014.7034265 2

15. Guo, Y., Li, H., Zhuang, P.: Underwater image enhancement using a multiscaledense generative adversarial network. IEEE J. OCEANIC. ENG. 45(3), 862–870(2019) 2

16. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. In:CVPR. pp. 1956–1963 (2009). https://doi.org/10.1109/CVPR.2009.5206515 2

Page 17: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 17

17. Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: ICPR. pp. 2366–2369(2010). https://doi.org/10.1109/ICPR.2010.579 10

18. Huang, D., Wang, Y., Song, W., Sequeira, J., Mavromatis, S.: Shallow-water imageenhancement using relative global histogram stretching based on adaptive param-eter acquisition. In: MMM. pp. 453–465. Springer (2018) 2, 10, 12

19. Iqbal, K., Odetayo, M., James, A., Salam, R.A., Talib, A.Z.H.: En-hancing the low quality images using unsupervised colour correctionmethod. In: IEEE Int. Conf. Syst. Man. Cybern. pp. 1703–1709 (2010).https://doi.org/10.1109/ICSMC.2010.5642311 2

20. Islam, M.J., Xia, Y., Sattar, J.: Fast underwater image enhancement for im-proved visual perception. IEEE Robot. Autom. Lett. 5(2), 3227–3234 (2020).https://doi.org/10.1109/LRA.2020.2974710 1, 2, 4, 5, 10, 11, 12, 13, 22, 23, 24

21. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: CVPR. pp. 1125–1134 (2017) 2, 13

22. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer andsuper-resolution. In: ECCV (2016) 9

23. Korhonen, J., You, J.: Peak signal-to-noise ratio revisited: Is simple beautiful? In:QoMEX. pp. 37–38. IEEE (2012) 2, 10

24. Li, C.Y., Guo, J.C., Cong, R.M., Pang, Y.W., Wang, B.: Underwa-ter image enhancement by dehazing with minimum information lossand histogram distribution prior. IEEE TIP 25(12), 5664–5677 (2016).https://doi.org/10.1109/TIP.2016.2612882 2

25. Li, C.Y., Guo, J.C., Cong, R.M., Pang, Y.W., Wang, B.: Underwa-ter image enhancement by dehazing with minimum information lossand histogram distribution prior. IEEE TIP 25(12), 5664–5677 (2016).https://doi.org/10.1109/TIP.2016.2612882 2, 5

26. Li, C., Anwar, S., Hou, J., Cong, R., Guo, C., Ren, W.: Underwater image en-hancement via medium transmission-guided multi-color space embedding. IEEETIP 30, 4985–5000 (2021) 2, 4, 10, 12, 13, 22, 23, 24

27. Li, C., Anwar, S., Porikli, F.: Underwater scene prior inspired deep underwaterimage and video enhancement. Pattern Recognition 98, 107038 (2020) 10

28. Li, C., Guo, C., Ren, W., Cong, R., Hou, J., Kwong, S., Tao, D.: An underwaterimage enhancement benchmark dataset and beyond. IEEE TIP 29, 4376–4389(2020). https://doi.org/10.1109/TIP.2019.2955241 2, 3, 4, 5, 6, 10, 12, 13, 14, 20

29. Li, C., Guo, J., Chen, S., Tang, Y., Pang, Y., Wang, J.: Underwaterimage restoration based on minimum information loss principle and op-tical properties of underwater imaging. In: ICIP. pp. 1993–1997 (2016).https://doi.org/10.1109/ICIP.2016.7532707 2, 5

30. Li, C., Guo, J., Guo, C.: Emerging from water: Underwater image color correctionbased on weakly supervised color transfer. IEEE Signal. Process. Lett. 25(3), 323–327 (2018) 2, 3, 5

31. Li, J., Skinner, K.A., Eustice, R.M., Johnson-Roberson, M.: Watergan: Unsuper-vised generative network to enable real-time color correction of monocular under-water images. IEEE Robot. Autom. Lett. 3(1), 387–394 (2017) 2, 3, 4, 5, 6

32. Li, X., Li, A.: An improved image enhancement method based on lab color spaceretinex algorithm. In: Li, C., Yu, H., Pan, Z., Pu, Y. (eds.) ICGIP. vol. 11069,pp. 756 – 765. International Society for Optics and Photonics, SPIE (2019).https://doi.org/10.1117/12.2524449, https://doi.org/10.1117/12.2524449 2

33. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Imagerestoration using swin transformer. In: ICCV Workshops. pp. 1833–1844 (October2021) 5

Page 18: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

18 Lintao Peng, Chunli Zhu, and Liheng Bian

34. Liu, R., Fan, X., Zhu, M., Hou, M., Luo, Z.: Real-world underwater enhancement:Challenges, benchmarks, and solutions under natural light. IEEE Trans. Circuits.Syst. Video Technol. 30, 4861–4875 (2020) 2, 4, 5, 6

35. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swintransformer: Hierarchical vision transformer using shifted windows. arXiv preprintarXiv:2103.14030 (2021) 5

36. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans-former (2021) 5

37. Ma, Z., Oh, C.: A wavelet-based dual-stream network for underwater image en-hancement. arXiv preprint arXiv:2202.08758 (2022) 5

38. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind”image quality analyzer. IEEE Signal Process. Lett. 20(3), 209–212 (2013).https://doi.org/10.1109/LSP.2012.2227726 10

39. Panetta, K., Gao, C., Agaian, S.: Human-visual-system-inspired underwa-ter image quality measures. IEEE J. Ocean. Eng. 41(3), 541–551 (2016).https://doi.org/10.1109/JOE.2015.2469915 5, 10

40. Peng, Y.T., Cosman, P.C.: Underwater image restoration based on image blurrinessand light absorption. IEEE TIP 26(4), 1579–1594 (2017) 2, 5, 10, 12, 13, 22, 23,24

41. Polikar, R.: Ensemble learning. In: Ensemble machine learning, pp. 1–34. Springer(2012) 5

42. Qi, Q., Li, K., Zheng, H., Gao, X., Hou, G., Sun, K.: Sguie-net: Semantic attentionguided underwater image enhancement with multi-scale perception. arXiv preprintarXiv:2201.02832 (2022) 5

43. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-ical image segmentation. In: MICCAI. pp. 234–241. Springer (2015) 11

44. Sahu, P., Gupta, N., Sharma, N.: A survey on underwater image enhancementtechniques. IJCA 87(13) (2014) 1

45. Schettini, R., Corchs, S.: Underwater image processing: State of the art of restora-tion and image enhancement methods. EURASIP. J. Adv. Signal Process. 2010,1–14 (2010) 1, 2

46. Song, W., Wang, Y., Huang, D., Tjondronegoro, D.: A rapid scene depth esti-mation model based on underwater light attenuation prior for underwater imagerestoration. In: PCM. pp. 678–688. Springer (2018) 2, 5

47. Uplavikar, P.M., Wu, Z., Wang, Z.: All-in-one underwater image enhancementusing domain-adversarial learning. In: CVPR Workshops. pp. 1–8 (2019) 2, 4, 5,10, 11, 12, 13, 22, 23

48. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017) 4, 7

49. Wang, H., Cao, P., Wang, J., Zaiane, O.R.: Uctransnet: Rethinking the skip con-nections in u-net from a channel-wise perspective with transformer (2021) 7

50. Wang, Y., Liu, H., Chau, L.P.: Single underwater image restoration using adaptiveattenuation-curve prior. IEEE Trans. Circuits. Syst. I. Regul. Pap. 65(3), 992–1002(2018). https://doi.org/10.1109/TCSI.2017.2751671 2

51. Yang, H.Y., Chen, P.Y., Huang, C.C., Zhuang, Y.Z., Shiau, Y.H.: Low complexityunderwater image enhancement based on dark channel prior. In: IBICA. pp. 17–20(2011). https://doi.org/10.1109/IBICA.2011.9 2, 5

52. Yang, M., Hu, J., Li, C., Rohde, G., Du, Y., Hu, K.: An in-depth survey of under-water image enhancement and restoration. IEEE Access. 7, 123638–123657 (2019)1

Page 19: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 19

53. Yang, M., Hu, K., Du, Y., Wei, Z., Sheng, Z., Hu, J.: Underwater image enhance-ment based on conditional generative adversarial network. Signal Process., ImageCommun. 81, 115723 (2020) 3

54. Yang, M., Sowmya, A.: An underwater color image quality evaluation metric. IEEETIP 24(12), 6062–6071 (2015). https://doi.org/10.1109/TIP.2015.2491020 5, 10

55. Ye, T., Jiang, M., Zhang, Y., Chen, L., Chen, E., Chen, P., Lu, Z.: Perceiving andmodeling density is all you need for image dehazing (2021) 5

56. Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer:Efficient transformer for high-resolution image restoration (2021) 5

57. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV.pp. 16259–16268 (October 2021) 5

58. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image transla-tion using cycle-consistent adversarial networks. In: ICCV. pp. 2242–2251 (2017).https://doi.org/10.1109/ICCV.2017.244 3

Page 20: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

A Overview

This supplementary document first illustrates the training details of the U-shapeTransformer in Sec.B, then provides more details about the LSUI dataset inSec.C and comparisons in Sec.D. We also exhibit the comparison of underwatervideo enhancement results of different underwater image enhancement methodsin the supplementary video. Finally, we explain why we choose LAB, LCH andRGB to form a multi-color space loss function through color space selectionexperiments in Sec.E.

Fig. 7. Statistics of our LSUI dataset and the existing underwater dataset UIEB[28].

B LSUI Dataset

As shown in Fig .7, compared with UIEB[28], our LSUI dataset contains moreimages and richer underwater scenes and object categories. In particular, ourLSUI dataset includes deep-sea scenes and underwater cave scenes that are notavailable in previous underwater datasets. We provide some examples of ourLSUI dataset in Fig .8, which shows that our LSUI dataset contains rich under-water scenes, water types, lighting conditions and target categories. As far as weknow, LSUI is the largest real underwater image dataset with high quality refer-ence images until now, and we believe it will facilitate the further developmentof underwater imaging techniques.

Page 21: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 21

≥5m

20m

50m

100%

50%

25%

12.5%

6.1%200m

100m

Reference Image

Original Image

Segmentation Map

Medium Transmission Map

Fig. 8. Example images in the LSUI dataset. The top of each image group is the refer-ence image, followed by the original image, semantic segmentation map, and mediumtransmission map.

C Training Details

We use python and pytorch framework via NVIDIA RTX3090 on Ubuntu20 toimplement the U-shape Transformer. Adam optimization algorithm is utilizedfor the total of 800 epochs training with batchsize set as 6. The initial learn-ing rate is set as 0.0005 and 0.0002 for the first 600 epochs and the last 200epochs, respectively. Besides, the learning rate decreased 20% every 40 epochs.For LossRGB , L2 loss is used for the first 600 epochs, and L1 loss is used for thelast 200 epochs.

D More Visual Comparisons

D.1 Visual Examples of Dataset Evaluation

Fig. 9 is the sampled enhancement results of U-shape transformer trained ondifferent underwater datasets, which is a supplement of the Data Evaluationpart of the paper. Enhancement results training on Train-L (a portion of ourLSUI dataset) demonstrates the highest PSNR value and preferable visual qual-ity, while results training on other datasets show a certain degree of color cast.For the high-quality reference images and rich underwater scenes (lighting con-ditions, water types and target categories), our constructed LSUI dataset couldimprove the imaging quality and generalization performance of the UIE network.

D.2 Visual Comparisons of Non-Reference Experiments

Fig. 10 is a supplement to the Non-reference Evaluation of the paper. Enhance-ment results of our method have the highest PS value, which index reflects the

Page 22: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

22 Lintao Peng, Chunli Zhu, and Liheng Bian

(a)

(b)

(c)

(d)

(e)

PSNR 12.09 PSNR 14.49 PSNR 16.42 PSNR 16.53 PSNR 15.56 PSNR 17.78 PSNR 20.50 PSNR 22.78

PSNR 21.16 PSNR 17.98 PSNR 23.41 PSNR 19.13 PSNR 23.01 PSNR 24.23 PSNR 21.12 PSNR 27.41

PSNR 20.70 PSNR 20.20 PSNR 26.06 PSNR 17.55 PSNR 23.25 PSNR 23.79 PSNR 22.69 PSNR 27.52

PSNR 22.86 PSNR 26.51 PSNR 31.93 PSNR 20.39 PSNR 30.33 PSNR 30.98 PSNR 31.40 PSNR 29.75

Fig. 9. Enhancement results of U-shape transformer trained on different underwaterdatasets. (a): Input images sampled from Test-L504; (b): Enhanced results using themodel trained on the Train-U; (c): Enhanced results using the model trained on theTrain-E; (d): Enhanced results using the model trained by our proposed dataset Train-L; (e): Reference images(recognized as ground truth (GT)).

visual quality. Generally, compared methods are unsatisfactory, which includesundesirable color artifacts, over-saturation and unnatural color casts. Amongthe methods, results of the UIBLA [40] and FUnIE [20] are collectively red-dish and yellowish, respectively. Retinex based[12] method introduces artifactsand unnatural colors. UGAN[10] and UIE-DAL[47] have the issue of local over-enhancement and color artifacts, which main reason is they ignore the inconsis-tent attenuation characteristics of the underwater images in the different spaceareas and the color channels. Although Ucolor [26] introduces the transmissionmedium prior to reinforcing the network’s attention on the spatial area withsevere attenuation, it still ignores the inconsistent attenuation characteristics ofthe underwater image in different color channels, which results in the problemof overall color cast.

D.3 Visual Comparisons of Full-Reference Experiments

Fig. 11 is a supplement to the Full-reference Evaluation of the paper. Results ofour method are the closest with the reference images, which have the best visualquality and the highest PSNR value. Five selected methods exhibit a differentdegree of color artifacts and casts, which enhancement results differ consider-ably from the object’s original color and fall far short from the reference images.Among the methods, UIBLA[40] does not remove the color casts (bias to green)

Page 23: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

U-shape Transformer for Underwater Image Enhancement 23

OursUcolor[26]UIE-DAL[47]UGAN[10]FUnIE[20]Retinex[12]UIBLA[40]Input

PS 1.79 PS 2.03 PS 1.68 PS 2.14 PS 1.92 PS 1.64 PS 3.01 PS 3.46

PS 1.42 PS 1.21 PS 2.45 PS 1.33 PS 2.32 PS 2.37 PS 3.19 PS 3.76

PS 1.15 PS 1.67 PS 1.72 PS 1.46 PS 2.31 PS 1.64 PS 2.42 PS 2.97

PS 1.33 PS 1.51 PS 1.11 PS 1.47 PS 2.98 PS 1.02 PS 3.21 PS 3.87

Fig. 10. Visual comparison of non-reference experiments sampled from the Test-U60dataset(UIEB). From left to right are raw images, results of UIBLA[40], Retinexbased[12], FUnIE[20], UGAN[10], Ucolor[26], UIE-DAL[47] and our U-shape Trans-former.

and exhibits severe color artifacts. Retinex based[12] method could improve theimage contrast to a certain extent, but the color of the enhanced image is un-natural. The enhancement result of FUnLE[20] is yellowish overall. AlthoughUGAN[10] and Ucolor[26] could provide relatively good color appearance, theyare often affected by local over-enhancement, and there are still some color castsin the results. The limitation of the UGAN and Ucolor method is that theirnetwork structures do not consider the inconsistent attenuation of underwaterimages in different color channels and spatial regions.

E Color Space Selection

In order to select the appropriate color space to form the multi-color loss func-tion, we use the mixed loss function composed of the single color space loss func-tion and other loss functions to train the U-shape transformer. We use Train-L totrain the network, and then test and calculate PSNR on Test-L504 and Test-U90data sets, respectively. The results are shown in Tab. 5,

Table 5. Statistical results of color space selection experiments. We test U-shapeTransformers trained with different color space loss functions on Test-L504 and Test-U90 datasets, respectively, and the color spaces that obtain the top three PSNR scoresare marked with red, green, and blue, respectively.

ColorSpace

RGB HSV HSI XYZ LAB LUV LCH YUV

Tset-L504 23.79 23.32 23.37 22.63 23.86 22.81 23.62 23.43

Test-U90 22.72 22.01 22.17 21.69 22.53 21.77 22.49 22.23

Page 24: Abstract arXiv:2111.11843v1 [cs.CV] 23 Nov 2021

24 Lintao Peng, Chunli Zhu, and Liheng Bian

OursUcolor[26]UGAN[10]FUnIE[20]Retinex[12]UIBLA[40]Input Ground truth

PSNR 16.24

PSNR 16.18 PSNR 19.62 PSNR 19.51 PSNR 17.53 PSNR 22.15 PSNR 21.57 PSNR 24.43

PSNR 18.10 PSNR 22.00 PSNR 18.19 PSNR 20.87 PSNR 23.29 PSNR 25.60

PSNR 12.51 PSNR 13.42 PSNR 17.04 PSNR 22.01 PSNR 16.60 PSNR 26.83 PSNR 30.01

PSNR 14.83 PSNR 16.67 PSNR 20.44 PSNR 13.71 PSNR 16.02 PSNR 17.84 PSNR 24.87

PSNR 15.95 PSNR 18.86 PSNR 24.53 PSNR 18.29 PSNR 20.44 PSNR 22.63 PSNR 27.94

Fig. 11. Visual comparison of full-reference experiments sampled from the Test-L504dataset. From left to right are raw images, results of UIBLA[40], Retinex based[12],FUnIE[20], UGAN[10], Ucolor[26], our U-shape Transformer and the reference image(recognized as ground truth (GT)).

As in Tab. 5, We note that the LAB, LCH, and RGB color spaces achievethe top-3 PSNR scores on both test datasets. In RGB color space, image iseasy to store and display because of its strong color physical meaning, but thesethree components (R, G, and B) are highly correlated and easily affected bybrightness, shadows, noise, and other factors. Compared with other color spaces,LAB color space is more consistent with the characteristics of human visual, canexpress all colors that human eyes can perceive, and the color distribution ismore uniform. LCH color space can intuitively express brightness, saturation,and hue. Combined with the experimental results and the above analysis, wechoose LAB, LCH, and RGB color space to form our multi-color space lossfunction.