Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables:Supplementary Material
Tao Wang*, Yong li*, Jingyang Peng*, Yipeng Ma, Xian Wang, Fenglong Song†, Youliang Yan†
Huawei Noah’s Ark Lab{wangtao10,liyong156,pengjingyang,mayipeng,wangxian10,songfenglong,yanyouliang}@huawei.com
ContentsThe structure of this supplementary material is as fol-
lows. Section 1 introduces the structure of our network andspatial-aware trilinear interpolation . Section 2 gives moredetails on the definition of the loss function. More com-parison results and additional analysis are demonstrated inSection 3.
1. MethodologySelf-adaptive two-head weight predictor. Our self-
adaptive two-head weight predictor adopts a UNet-stylestructure, consisting of an encoder for features compres-sion and a decoder for pixel-wise category prediction. Skipconnections concatenate output features from encoder lay-ers to corresponding decoder layers, and the detailed struc-ture is illustrated in Figure 1. Additionally, we also intro-duce another predictor for image-leval scenes categoriza-tion, whose input is compressed global features from theencoder. Its architecture is listed in Table 1, where featuresize C ×H ×W denotes the corresponding layers producefeatures with C channels of shape H × W and T is thenumber of image-leval scenes.
Layer Feature SizeGlobal Feature 256× 1× 1FC with LeakyRelu 128× 1× 1Instance Norm 128× 1× 1FC with LeakyRelu 64× 1× 1Instance Norm 64× 1× 1FC with LeakyRelu T × 1× 1
Table 1: Network architecture of scenes category predictorin the self-adaptive two-head weight predictor.
Spatial-aware trilinear interpolation. The core ideaof 3D LUT is to retouch input images according to some
*Authors contributed equally†Corresponding author
compressed parameters(i.e., LUTs), which means that oneelement in 3D Lattice may correspond to multiple neighbor-hood elements. So an additional operation is needed to im-prove the smoothness of the enhanced results. Consideringthe efficiency the performance, trilinear based interpolationis used in our method.
Let Xh,w = {Xh,w,r, Xh,w,g, Xh,w,b} be a pixel ininput image at location (h,w). Whatever RGB values ithas, there would be eight nearest neighbours when Xh,w
is mapped into a 3D LUT. The minimum coordinate foreight neighbours (i, j, k) in a 3D LUT is obtained through alookup operation with its RGB value (Ir, Ig, Ib) as follows:
i′ =Xh,w,r
∆, j′ =
Xh,w,g
∆, k′ =
Xh,w,b
∆i = bi′c, j = bj′c, k = bk′c (1)
where ∆ = Cmax/(N − 1), Cmax is the maximum colorvalue and b·c is the floor function. Distance between itsexact coordinate and the minimum neighbour coordinate aredefined as dri , d
gj , d
bk.
dri = i′ − i, dgj = j′ − j, dbk = k′ − k
dri+1 = 1− dri , dgj+1 = 1− dgj , d
bk+1 = 1− dbk (2)
Combining the Equation 2(image level scenario adapta-tion) and Equation 3(pixel-wise category fusion) in our pa-per, the interpolated output {Y h,w,c|c ∈ {r, g, b}} at lo-cation (h,w) can be obtained as follows. Owning to thespatial-aware attribute of the pixel-wise category weightmap αh,w
m , the interpolation is defined as spatial-aware tri-linear interpolation.
Y h,w,c =T−1∑t=0
ωt ∗ (i+1∑ii=i
j+1∑jj=j
k+1∑kk=k
driidgjjd
bkkO
h,w,c(ii,jj,kk))
=T−1∑t=0
i+1∑ii=i
j+1∑jj=j
k+1∑kk=k
ωtdriid
gjjd
bkk
M−1∑m=0
αh,wm Om,c
(ii,jj,kk)
(3)
h=256
w=256
h=H
w=W
25616 32 64 128 128 64 32 16 16 MGlobal
Feature3 M
Spatial-aware
Trilinear
Interpolation…
M
𝜃0 𝜃1 𝜃𝑀−1
M
Pixel-wise Category
Output
Input
INconv
k=3 s=1FC
conv
k=3 s=2
Leaky
ReLUResize
Average
Pool
Encode
Block
Decode
Block
Local
ConcatTile
[H,W,3]
𝜔0
𝜔1
𝜔𝑇−1
…T
M
…
…
…
…… …
𝑣0 𝑣1 𝑣𝑀−1
T
𝑉0
𝑉1
𝑉𝑇−1
Figure 1: Network architecture of self-adaptive two-head weight predictor.
2. Loss FunctionWe train our network on a pair-wise dataset D =
{(Xs,Ys)|s ∈ IS−10 } with supervised method, where Xs
is an input image and Ys is the corresponding target image.S is the number of image pairs and s ∈ IS−10 is short fors = 0, 1, . . . , S − 1. Color Difference Loss Lc and Percep-tion Loss Lp are introduced in more detail in the following.
Color Difference Loss. To measue the color distanceand encourage the color in the enhanced image to match thatin the corresponding learning target, we use CIE94 in LABcolor space as our color loss. Let (L, a, b), (L, a, b) denotepredicted and target image in LAB color space, respectively.ThenLc can be defined as Equation 4. Detailed descriptionsabout it can be found in [5].
C1, C2 =
√a2 + b2 + ε,
√a2 + b2 + ε
SC , SH = 1 + 0.0225 ∗ (C1 + C2), 1 + 0.0075 ∗ (C1 + C2)
∆a, ∆b = a− a, b− b∆C,∆L = C1 − C2, L− L
∆H =√
∆a2 + ∆b2 −∆C2 + ε
Lc =
√∆L2 +
(∆C
SC
)2
+
(∆H
SH
)2
+ ε (4)
Perception Loss. To improve the perceptual quality of theenhanced image, a widely used LPIPS loss [9] is chosen. Itis defined as weighted norms of L2-distance between fea-tures of ground truth images and enhanced images on a pre-
trained AlexNet:
Lp =∑l
1
H lW l
Hl,W l∑h=1,w=1
βl ×∥∥ylh,w − ylh,w∥∥22 (5)
where l is the layer chosen to calculate the LPIPS loss, βlis the weight for the layer l, and yl, yl is the correspondingground truth features and enhanced features. By default,we choose outputs from the first five ReLU layers fromAlexNet, and all βl are set to 1.
3. Additional AnalysisIn this section, we compare our model with several
SOTA methods on two datasets with different resolution.Visualizations show that our model outcompetes othermethods on both 480p datasets and on high resolutiondataset.
3.1. Comparison on 480p MIT-Adobe FiveKDataset (released by [8])
We directly use the released dataset and nothing ischanged. To be clear, it contains 4500 training pairs and 498pairs for testing. More visual comparisons can be found inFigure 2 and Figure 3. Each figure shows results from eightmethods and their corresponding error maps.
3.2. Comparison on Full Resolution MIT-AdobeFiveK Dataset (Ours)
In this subsection, we train and test the proposedmodel on our constructed full-resolution MIT-Adobe FiveK
dataset. More visual comparisons can be found in Figure 4and Figure 5.
3.3. Comparison on 480p HDR+ Burst PhotographyDataset (released by [8])
We test performance on the open source HDR+ burstPhotography dataset released by [8]. More visual compar-isons can be found in Figure 6 and Figure 7.
3.4. Comparison on 480p HDR+ Burst PhotographyDataset (Ours)
We also test performance on our constructed 480p HDR+burst Photography dataset, with post-processing mentionedin our paper. More visual comparisons can be found in Fig-ure 8, Figure 9 and Figure 10.
3.5. Comparison on Full Resolution HDR+ BurstPhotography Dataset (Ours)
We then test our algorithm on our constructed full res-olution HDR+ burst photography dataset. The dataset ispost-processed generally the same as what we mentioned insection 3.4. The only difference is that all image pairs arekept as their original size, and no resize is applied. Resultsshow that our model work well on high resolution images.More visual comparisons can be found in Figure 11, Fig-ure 12 and Figure 13.
3.6. Failure Cases
We further show some test cases where our model fails toproduce visually pleasant enhanced images. Results showthat our model sometimes may suffer from artifacts. In thissection, we directly used the full resolution HDR+ burstphotography dataset as mentioned in section 3.5. More vi-sual comparisons can be found in Figure 14 and Figure 15.
References[1] Michael Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W
Hasinoff, and Fredo Durand. Deep bilateral learning for real-time image enhancement. ACM Transactions on Graphics(TOG), 36(4):1–12, 2017. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17
[2] Jie Huang, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Hybrid image enhancement with progressive lapla-cian enhancing unit. In Proceedings of the 27th ACM Interna-tional Conference on Multimedia, pages 1614–1622, 2019. 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
[3] Jie Huang, Pengfei Zhu, Mingrui Geng, Jiewen Ran, Xing-guang Zhou, Chen Xing, Pengfei Wan, and Xiangyang Ji.Range scaling global u-net for perceptual image enhancementon mobile devices. In Proceedings of the European Confer-ence on Computer Vision (ECCV), pages 0–0, 2018. 4, 5, 6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
[4] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, KennethVanhoey, and Luc Van Gool. Dslr-quality photos on mobiledevices with deep convolutional networks. In Proceedings ofthe IEEE International Conference on Computer Vision, pages3277–3285, 2017. 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
[5] Bruce Justin Lindbloom. Delta E (CIE 1994), 2017 (accessedNovember 10, 2020). http://www.brucelindbloom.com/index.html?Eqn_DeltaE_CIE94.html. 2
[6] Sean Moran, Pierre Marza, Steven McDonagh, Sarah Parisot,and Gregory Slabaugh. Deeplpf: Deep local parametric fil-ters for image enhancement. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 12826–12835, 2020. 4, 5, 8, 9, 10, 11, 12
[7] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen,Wei-Shi Zheng, and Jiaya Jia. Underexposed photo enhance-ment using deep illumination estimation. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition, pages 6849–6857, 2019. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17
[8] Hui Zeng, Jianrui Cai, Lida Li, Zisheng Cao, and Lei Zhang.Learning image-adaptive 3d lookup tables for high perfor-mance photo enhancement in real-time. IEEE Transactionson Pattern Analysis and Machine Intelligence, 2020. 2, 3, 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
[9] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,and Oliver Wang. The unreasonable effectiveness of deepfeatures as a perceptual metric. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages586–595, 2018. 2
(a)I
nput
(b)R
SGU
Net
[3]
(c)D
PED
[4]
(d)H
PEU
[2]
(e)D
eepL
PF[6
]
(f)U
PE[7
](g
)HD
RN
et[1
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re2:
Res
ults
com
pari
son
on‘a
4050
’of
480p
MIT
-Ado
beFi
veK
data
set,
and
corr
espo
ndin
ger
ror
map
s.Fo
rea
chpa
irof
resu
lts,
the
uppe
rim
age
isan
enha
nced
imag
e,an
dth
eim
age
belo
wis
aner
rorm
apbe
twee
nth
een
hanc
edim
age
and
the
grou
nd-t
ruth
.
(a)I
nput
(b)R
SGU
Net
[3]
(c)D
PED
[4]
(d)H
PEU
[2]
(e)D
eepL
PF[6
]
(f)U
PE[7
](g
)HD
RN
et[1
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re3:
Res
ults
com
pari
son
on‘a
4816
’of4
80p
MIT
-Ado
beFi
veK
data
set,
and
corr
espo
ndin
ger
rorm
aps.
(a) Input (b) RSGUnet [3] (c) HPEU [2] (d) UPE [7]
(e) HDRNet [1] (f) 3DLUT [8] (g) Ours (h) Ground-truth
Figure 4: Results comparison on ‘a4163’ of full-resolution MIT-Adobe FiveK dataset, and corresponding error maps. Foreach pair of results, the upper image is an enhanced image, and the image below is an error map between the enhanced imageand the ground-truth.
(a) Input (b) RSGUnet [3] (c) HPEU [2] (d) UPE [7]
(e) HDRNet [1] (f) 3DLUT [8] (g) Ours (h) Ground-truth
Figure 5: Results comparison on ‘a1544’ of full-resolution MIT-Adobe FiveK dataset, and corresponding error maps.
(a)I
nput
(b)D
eepL
PF[6
](c
)DPE
D[4
](d
)HPE
U[2
](e
)RSG
UN
et[3
]
(f)H
DR
Net
[1]
(g)U
PE[7
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re6:
Res
ults
com
pari
son
on‘1
671’
ofH
DR
+bu
rstp
hoto
grap
hyda
tase
trel
ease
dby
[8],
and
corr
espo
ndin
ger
rorm
aps.
We
can
see
that
ourm
etho
dge
tsle
sser
rora
ndar
tifac
tsin
the
sky
than
othe
rmet
hods
.
(a)I
nput
(b)D
eepL
PF[6
](c
)DPE
D[4
](d
)HPE
U[2
](e
)RSG
UN
et[3
]
(f)H
DR
Net
[1]
(g)U
PE[7
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re7:
Res
ults
com
pari
son
on‘1
517’
ofH
DR
+bu
rstp
hoto
grap
hyda
tase
trel
ease
dby
[8],
and
corr
espo
ndin
ger
rorm
aps.
We
can
see
that
ourm
etho
dge
tsle
sser
rori
nbo
thth
efo
regr
ound
and
the
back
grou
ndar
eas
than
the
com
petin
gap
proa
ches
.
(a)I
nput
(b)D
eepL
PF[6
](c
)DPE
D[4
](d
)HPE
U[2
](e
)RSG
UN
et[3
]
(f)H
DR
Net
[1]
(g)U
PE[7
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re8:
Res
ults
com
pari
son
on‘0
043
2016
1006
1620
5249
0’of
HD
R+
burs
tpho
togr
aphy
data
set(
Our
s),a
ndco
rres
pond
ing
erro
rmap
s.W
eca
nse
eth
atou
rm
etho
dge
tsle
sser
rora
ndar
tifac
tsin
the
sky
than
othe
rmet
hods
.
(a)I
nput
(b)D
eepL
PF[6
](c
)DPE
D[4
](d
)HPE
U[2
](e
)RSG
UN
et[3
]
(f)H
DR
Net
[1]
(g)U
PE[7
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re9:
Res
ults
com
pari
son
on‘0
006
2016
0726
1106
0966
6’of
HD
R+
burs
tpho
togr
aphy
data
set(
Our
s),a
ndco
rres
pond
ing
erro
rmap
s.W
eca
nse
eth
atou
rm
etho
dge
tsle
sser
rori
nbo
thth
efo
regr
ound
and
the
back
grou
ndar
eas
than
the
com
petin
gap
proa
ches
.
(a)I
nput
(b)D
eepL
PF[6
](c
)DPE
D[4
](d
)HPE
U[2
](e
)RSG
UN
et[3
]
(f)H
DR
Net
[1]
(g)U
PE[7
](h
)3D
LU
T[8
](i
)Our
s(j
)Gro
und-
trut
h
Figu
re10
:R
esul
tsco
mpa
riso
non
‘5a9
e20
1504
0316
2152
482’
ofH
DR
+bu
rstp
hoto
grap
hyda
tase
t(O
urs)
,and
corr
espo
ndin
ger
ror
map
s.H
PEU
,HD
RN
et,
UPE
and
3DL
UT
getl
arge
erro
rsin
both
build
ing
and
sky
area
s.R
SGU
Net
perf
orm
sbe
tter
onsk
yar
eas,
thou
ghth
eri
ghts
ide
ofth
esk
yar
eaha
sla
rger
erro
r.A
rtifa
cts
occu
rin
the
sky
area
ofth
eD
PED
met
hod,
and
som
ebu
ildin
gde
tails
are
lost
.Dee
pLPF
show
seq
ualq
ualit
yco
mpa
red
with
ourm
etho
d,bu
titi
sno
ta4K
real
-tim
eal
gori
thm
.
(a) RSGUnet [3] (b) DPED [4] (c) HPEU [2] (d) UPE [7]
(e) HDRNet [1] (f) 3DLUT [8] (g) Ours (h) Ground-truth
Figure 11: Results comparison on ‘0382 20150924 100900 333’ of full resolution HDR+ burst photography dataset, andcorresponding error maps. Our result is the closest to ground truth in color and perception. DPED and HPEU show visiblebanding artifacts in sky, while RSGUNet, DeepUPE, HDRNet and original 3DLUT suffer from color bias in both sky andgrass. Our spatial-aware 3DLUT has the smallest error to ground truth, and the most pleasant visual perception in localcontrast.
(a) RSGUnet [3] (b) DPED [4] (c) HPEU [2] (d) UPE [7]
(e) HDRNet [1] (f) 3DLUT [8] (g) Ours (h) Ground-truth
Figure 12: Results comparison on ‘0919 20150910 150832 572’ of full resolution HDR+ burst photography dataset, andcorresponding error maps.
(a) RSGUnet [3] (b) DPED [4] (c) HPEU [2] (d) UPE [7]
(e) HDRNet [1] (f) 3DLUT [8] (g) Ours (h) Ground-truth
Figure 13: Results comparison on ‘1125 20151229 192447 145’ of full resolution HDR+ burst photography dataset, andcorresponding error maps.
(a) Input (b) RSGUnet [3] (c) DPED [4]
(d) HPEU [2] (e) UPE [7] (f) HDRNet [1]
(g) 3DLUT [8] (h) Ours (i) Ground-truth
Figure 14: One failure case on ‘5a9e 20141005 162240 437’ of full resolution HDR+ burst photography dataset. Althoughour model is the closest to ground truth in most areas, halo artifacts are sometimes visible in our results where color changessharply (e.g., around trunks). We believe this is caused by the unsmooth pixel-wise category weights generated by ourtwo-head weight predictor (i.e., category weights change un-smoothly). Thus, such drawback can potentially be solved byintroducing additional smooth loss as introduced in our paper.
(a) Input (b) RSGUnet [3] (c) DPED [4]
(d) HPEU [2] (e) UPE [7] (f) HDRNet [1]
(g) 3DLUT [8] (h) Ours (i) Ground-truth
Figure 15: One failure case on ‘0006 20160722 101752 239’ of full resolution HDR+ burst photography dataset. Our modelshows the best local contrast in most cases, however, all methods suffer from banding artifacts in extremely dark regionswhere Signal-to-Noise Ratio is poor. Since our inputs are of 8-bit, too little useful information is carried in extremelydark areas and noises dominate signals. Possible solutions including cooperating with other denoising preprocessing, andreplacing low-precision 8-bit inputs with some high-precision 16-bit images.