Real-time Image Enhancer via Learnable Spatial-aware 3D

Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables:Supplementary Material

Tao Wang*, Yong li*, Jingyang Peng*, Yipeng Ma, Xian Wang, Fenglong Song†, Youliang Yan†

Huawei Noah’s Ark Lab{wangtao10,liyong156,pengjingyang,mayipeng,wangxian10,songfenglong,yanyouliang}@huawei.com

ContentsThe structure of this supplementary material is as fol-

lows. Section 1 introduces the structure of our network andspatial-aware trilinear interpolation . Section 2 gives moredetails on the definition of the loss function. More com-parison results and additional analysis are demonstrated inSection 3.

1. MethodologySelf-adaptive two-head weight predictor. Our self-

adaptive two-head weight predictor adopts a UNet-stylestructure, consisting of an encoder for features compres-sion and a decoder for pixel-wise category prediction. Skipconnections concatenate output features from encoder lay-ers to corresponding decoder layers, and the detailed struc-ture is illustrated in Figure 1. Additionally, we also intro-duce another predictor for image-leval scenes categoriza-tion, whose input is compressed global features from theencoder. Its architecture is listed in Table 1, where featuresize C ×H ×W denotes the corresponding layers producefeatures with C channels of shape H × W and T is thenumber of image-leval scenes.

Layer Feature SizeGlobal Feature 256× 1× 1FC with LeakyRelu 128× 1× 1Instance Norm 128× 1× 1FC with LeakyRelu 64× 1× 1Instance Norm 64× 1× 1FC with LeakyRelu T × 1× 1

Table 1: Network architecture of scenes category predictorin the self-adaptive two-head weight predictor.

Spatial-aware trilinear interpolation. The core ideaof 3D LUT is to retouch input images according to some

*Authors contributed equally†Corresponding author

compressed parameters(i.e., LUTs), which means that oneelement in 3D Lattice may correspond to multiple neighbor-hood elements. So an additional operation is needed to im-prove the smoothness of the enhanced results. Consideringthe efficiency the performance, trilinear based interpolationis used in our method.

Let Xh,w = {Xh,w,r, Xh,w,g, Xh,w,b} be a pixel ininput image at location (h,w). Whatever RGB values ithas, there would be eight nearest neighbours when Xh,w

is mapped into a 3D LUT. The minimum coordinate foreight neighbours (i, j, k) in a 3D LUT is obtained through alookup operation with its RGB value (Ir, Ig, Ib) as follows:

i′ =Xh,w,r

∆, j′ =

Xh,w,g

∆, k′ =

Xh,w,b

∆i = bi′c, j = bj′c, k = bk′c (1)

where ∆ = Cmax/(N − 1), Cmax is the maximum colorvalue and b·c is the floor function. Distance between itsexact coordinate and the minimum neighbour coordinate aredefined as dri , d

gj , d

bk.

dri = i′ − i, dgj = j′ − j, dbk = k′ − k

dri+1 = 1− dri , dgj+1 = 1− dgj , d

bk+1 = 1− dbk (2)

Combining the Equation 2(image level scenario adapta-tion) and Equation 3(pixel-wise category fusion) in our pa-per, the interpolated output {Y h,w,c|c ∈ {r, g, b}} at lo-cation (h,w) can be obtained as follows. Owning to thespatial-aware attribute of the pixel-wise category weightmap αh,w

m , the interpolation is defined as spatial-aware tri-linear interpolation.

Y h,w,c =T−1∑t=0

ωt ∗ (i+1∑ii=i

j+1∑jj=j

k+1∑kk=k

driidgjjd

bkkO

h,w,c(ii,jj,kk))

=T−1∑t=0

i+1∑ii=i

j+1∑jj=j

k+1∑kk=k

ωtdriid

gjjd

bkk

M−1∑m=0

αh,wm Om,c

(ii,jj,kk)

(3)

h=256

w=256

h=H

w=W

25616 32 64 128 128 64 32 16 16 MGlobal

Feature3 M

Spatial-aware

Trilinear

Interpolation…

M

𝜃0 𝜃1 𝜃𝑀−1

M

Pixel-wise Category

Output

Input

INconv

k=3 s=1FC

conv

k=3 s=2

Leaky

ReLUResize

Average

Pool

Encode

Block

Decode

Block

Local

ConcatTile

[H,W,3]

𝜔0

𝜔1

𝜔𝑇−1

…T

M

…

…

…

…… …

𝑣0 𝑣1 𝑣𝑀−1

T

𝑉0

𝑉1

𝑉𝑇−1

Figure 1: Network architecture of self-adaptive two-head weight predictor.

2. Loss FunctionWe train our network on a pair-wise dataset D =

{(Xs,Ys)|s ∈ IS−10 } with supervised method, where Xs

is an input image and Ys is the corresponding target image.S is the number of image pairs and s ∈ IS−10 is short fors = 0, 1, . . . , S − 1. Color Difference Loss Lc and Percep-tion Loss Lp are introduced in more detail in the following.

Color Difference Loss. To measue the color distanceand encourage the color in the enhanced image to match thatin the corresponding learning target, we use CIE94 in LABcolor space as our color loss. Let (L, a, b), (L, a, b) denotepredicted and target image in LAB color space, respectively.ThenLc can be defined as Equation 4. Detailed descriptionsabout it can be found in [5].

C1, C2 =

√a2 + b2 + ε,

√a2 + b2 + ε

SC , SH = 1 + 0.0225 ∗ (C1 + C2), 1 + 0.0075 ∗ (C1 + C2)

∆a, ∆b = a− a, b− b∆C,∆L = C1 − C2, L− L

∆H =√

∆a2 + ∆b2 −∆C2 + ε

Lc =

√∆L2 +

(∆C

SC

)2

+

(∆H

SH

)2

+ ε (4)

Perception Loss. To improve the perceptual quality of theenhanced image, a widely used LPIPS loss [9] is chosen. Itis defined as weighted norms of L2-distance between fea-tures of ground truth images and enhanced images on a pre-

trained AlexNet:

Lp =∑l

1

H lW l

Hl,W l∑h=1,w=1

βl ×∥∥ylh,w − ylh,w∥∥22 (5)

where l is the layer chosen to calculate the LPIPS loss, βlis the weight for the layer l, and yl, yl is the correspondingground truth features and enhanced features. By default,we choose outputs from the first five ReLU layers fromAlexNet, and all βl are set to 1.

3. Additional AnalysisIn this section, we compare our model with several

SOTA methods on two datasets with different resolution.Visualizations show that our model outcompetes othermethods on both 480p datasets and on high resolutiondataset.

3.1. Comparison on 480p MIT-Adobe FiveKDataset (released by [8])

We directly use the released dataset and nothing ischanged. To be clear, it contains 4500 training pairs and 498pairs for testing. More visual comparisons can be found inFigure 2 and Figure 3. Each figure shows results from eightmethods and their corresponding error maps.

3.2. Comparison on Full Resolution MIT-AdobeFiveK Dataset (Ours)

In this subsection, we train and test the proposedmodel on our constructed full-resolution MIT-Adobe FiveK

dataset. More visual comparisons can be found in Figure 4and Figure 5.

3.3. Comparison on 480p HDR+ Burst PhotographyDataset (released by [8])

We test performance on the open source HDR+ burstPhotography dataset released by [8]. More visual compar-isons can be found in Figure 6 and Figure 7.

3.4. Comparison on 480p HDR+ Burst PhotographyDataset (Ours)

We also test performance on our constructed 480p HDR+burst Photography dataset, with post-processing mentionedin our paper. More visual comparisons can be found in Fig-ure 8, Figure 9 and Figure 10.

3.5. Comparison on Full Resolution HDR+ BurstPhotography Dataset (Ours)

We then test our algorithm on our constructed full res-olution HDR+ burst photography dataset. The dataset ispost-processed generally the same as what we mentioned insection 3.4. The only difference is that all image pairs arekept as their original size, and no resize is applied. Resultsshow that our model work well on high resolution images.More visual comparisons can be found in Figure 11, Fig-ure 12 and Figure 13.

3.6. Failure Cases

We further show some test cases where our model fails toproduce visually pleasant enhanced images. Results showthat our model sometimes may suffer from artifacts. In thissection, we directly used the full resolution HDR+ burstphotography dataset as mentioned in section 3.5. More vi-sual comparisons can be found in Figure 14 and Figure 15.

References[1] Michael Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W

Hasinoff, and Fredo Durand. Deep bilateral learning for real-time image enhancement. ACM Transactions on Graphics(TOG), 36(4):1–12, 2017. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17

[2] Jie Huang, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Hybrid image enhancement with progressive lapla-cian enhancing unit. In Proceedings of the 27th ACM Interna-tional Conference on Multimedia, pages 1614–1622, 2019. 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

[3] Jie Huang, Pengfei Zhu, Mingrui Geng, Jiewen Ran, Xing-guang Zhou, Chen Xing, Pengfei Wan, and Xiangyang Ji.Range scaling global u-net for perceptual image enhancementon mobile devices. In Proceedings of the European Confer-ence on Computer Vision (ECCV), pages 0–0, 2018. 4, 5, 6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

[4] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, KennethVanhoey, and Luc Van Gool. Dslr-quality photos on mobiledevices with deep convolutional networks. In Proceedings ofthe IEEE International Conference on Computer Vision, pages3277–3285, 2017. 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

[5] Bruce Justin Lindbloom. Delta E (CIE 1994), 2017 (accessedNovember 10, 2020). http://www.brucelindbloom.com/index.html?Eqn_DeltaE_CIE94.html. 2

[6] Sean Moran, Pierre Marza, Steven McDonagh, Sarah Parisot,and Gregory Slabaugh. Deeplpf: Deep local parametric fil-ters for image enhancement. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 12826–12835, 2020. 4, 5, 8, 9, 10, 11, 12

[7] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen,Wei-Shi Zheng, and Jiaya Jia. Underexposed photo enhance-ment using deep illumination estimation. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition, pages 6849–6857, 2019. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17

[8] Hui Zeng, Jianrui Cai, Lida Li, Zisheng Cao, and Lei Zhang.Learning image-adaptive 3d lookup tables for high perfor-mance photo enhancement in real-time. IEEE Transactionson Pattern Analysis and Machine Intelligence, 2020. 2, 3, 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

[9] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,and Oliver Wang. The unreasonable effectiveness of deepfeatures as a perceptual metric. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages586–595, 2018. 2

http://www.brucelindbloom.com/index.html?Eqn_DeltaE_CIE94.html

http://www.brucelindbloom.com/index.html?Eqn_DeltaE_CIE94.html

(a)I

nput

(b)R

SGU

Net

[3]

(c)D

PED

[4]

(d)H

PEU

[2]

(e)D

eepL

PF[6

]

(f)U

PE[7

](g

)HD

RN

et[1

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re2:

Res

ults

com

pari

son

on‘a

4050

’of

480p

MIT

-Ado

beFi

veK

data

set,

and

corr

espo

ndin

ger

ror

map

s.Fo

rea

chpa

irof

resu

lts,

the

uppe

rim

age

isan

enha

nced

imag

e,an

dth

eim

age

belo

wis

aner

rorm

apbe

twee

nth

een

hanc

edim

age

and

the

grou

nd-t

ruth

.

(a)I

nput

(b)R

SGU

Net

[3]

(c)D

PED

[4]

(d)H

PEU

[2]

(e)D

eepL

PF[6

]

(f)U

PE[7

](g

)HD

RN

et[1

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re3:

Res

ults

com

pari

son

on‘a

4816

’of4

80p

MIT

-Ado

beFi

veK

data

set,

and

corr

espo

ndin

ger

rorm

aps.

(a) Input (b) RSGUnet [3] (c) HPEU [2] (d) UPE [7]

(e) HDRNet [1] (f) 3DLUT [8] (g) Ours (h) Ground-truth

Figure 4: Results comparison on ‘a4163’ of full-resolution MIT-Adobe FiveK dataset, and corresponding error maps. Foreach pair of results, the upper image is an enhanced image, and the image below is an error map between the enhanced imageand the ground-truth.

(a) Input (b) RSGUnet [3] (c) HPEU [2] (d) UPE [7]


Figure 5: Results comparison on ‘a1544’ of full-resolution MIT-Adobe FiveK dataset, and corresponding error maps.

(a)I

nput

(b)D

eepL

PF[6

](c

)DPE

D[4

](d

)HPE

U[2

](e

)RSG

UN

et[3

]

(f)H

DR

Net

[1]

(g)U

PE[7

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re6:

Res

ults

com

pari

son

on‘1

671’

ofH

DR

+bu

rstp

hoto

grap

hyda

tase

trel

ease

dby

[8],

and

corr

espo

ndin

ger

rorm

aps.

We

can

see

that

ourm

etho

dge

tsle

sser

rora

ndar

tifac

tsin

the

sky

than

othe

rmet

hods

.

(a)I

nput

(b)D

eepL

PF[6

](c

)DPE

D[4

](d

)HPE

U[2

](e

)RSG

UN

et[3

]

(f)H

DR

Net

[1]

(g)U

PE[7

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re7:

Res

ults

com

pari

son

on‘1

517’

ofH

DR

+bu

rstp

hoto

grap

hyda

tase

trel

ease

dby

[8],

and

corr

espo

ndin

ger

rorm

aps.

We

can

see

that

ourm

etho

dge

tsle

sser

rori

nbo

thth

efo

regr

ound

and

the

back

grou

ndar

eas

than

the

com

petin

gap

proa

ches

.

(a)I

nput

(b)D

eepL

PF[6

](c

)DPE

D[4

](d

)HPE

U[2

](e

)RSG

UN

et[3

]

(f)H

DR

Net

[1]

(g)U

PE[7

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re8:

Res

ults

com

pari

son

on‘0

043

2016

1006

1620

5249

0’of

HD

R+

burs

tpho

togr

aphy

data

set(

Our

s),a

ndco

rres

pond

ing

erro

rmap

s.W

eca

nse

eth

atou

rm

etho

dge

tsle

sser

rora

ndar

tifac

tsin

the

sky

than

othe

rmet

hods

.

(a)I

nput

(b)D

eepL

PF[6

](c

)DPE

D[4

](d

)HPE

U[2

](e

)RSG

UN

et[3

]

(f)H

DR

Net

[1]

(g)U

PE[7

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re9:

Res

ults

com

pari

son

on‘0

006

2016

0726

1106

0966

6’of

HD

R+

burs

tpho

togr

aphy

data

set(

Our

s),a

ndco

rres

pond

ing

erro

rmap

s.W

eca

nse

eth

atou

rm

etho

dge

tsle

sser

rori

nbo

thth

efo

regr

ound

and

the

back

grou

ndar

eas

than

the

com

petin

gap

proa

ches

.

(a)I

nput

(b)D

eepL

PF[6

](c

)DPE

D[4

](d

)HPE

U[2

](e

)RSG

UN

et[3

]

(f)H

DR

Net

[1]

(g)U

PE[7

](h

)3D

LU

T[8

](i

)Our

s(j

)Gro

und-

trut

h

Figu

re10

:R

esul

tsco

mpa

riso

non

‘5a9

e20

1504

0316

2152

482’

ofH

DR

+bu

rstp

hoto

grap

hyda

tase

t(O

urs)

,and

corr

espo

ndin

ger

ror

map

s.H

PEU

,HD

RN

et,

UPE

and

3DL

UT

getl

arge

erro

rsin

both

build

ing

and

sky

area

s.R

SGU

Net

perf

orm

sbe

tter

onsk

yar

eas,

thou

ghth

eri

ghts

ide

ofth

esk

yar

eaha

sla

rger

erro

r.A

rtifa

cts

occu

rin

the

sky

area

ofth

eD

PED

met

hod,

and

som

ebu

ildin

gde

tails

are

lost

.Dee

pLPF

show

seq

ualq

ualit

yco

mpa

red

with

ourm

etho

d,bu

titi

sno

ta4K

real

-tim

eal

gori

thm

.

(a) RSGUnet [3] (b) DPED [4] (c) HPEU [2] (d) UPE [7]


Figure 11: Results comparison on ‘0382 20150924 100900 333’ of full resolution HDR+ burst photography dataset, andcorresponding error maps. Our result is the closest to ground truth in color and perception. DPED and HPEU show visiblebanding artifacts in sky, while RSGUNet, DeepUPE, HDRNet and original 3DLUT suffer from color bias in both sky andgrass. Our spatial-aware 3DLUT has the smallest error to ground truth, and the most pleasant visual perception in localcontrast.



Figure 12: Results comparison on ‘0919 20150910 150832 572’ of full resolution HDR+ burst photography dataset, andcorresponding error maps.



Figure 13: Results comparison on ‘1125 20151229 192447 145’ of full resolution HDR+ burst photography dataset, andcorresponding error maps.

(a) Input (b) RSGUnet [3] (c) DPED [4]

(d) HPEU [2] (e) UPE [7] (f) HDRNet [1]

(g) 3DLUT [8] (h) Ours (i) Ground-truth

Figure 14: One failure case on ‘5a9e 20141005 162240 437’ of full resolution HDR+ burst photography dataset. Althoughour model is the closest to ground truth in most areas, halo artifacts are sometimes visible in our results where color changessharply (e.g., around trunks). We believe this is caused by the unsmooth pixel-wise category weights generated by ourtwo-head weight predictor (i.e., category weights change un-smoothly). Thus, such drawback can potentially be solved byintroducing additional smooth loss as introduced in our paper.

(a) Input (b) RSGUnet [3] (c) DPED [4]

(d) HPEU [2] (e) UPE [7] (f) HDRNet [1]

(g) 3DLUT [8] (h) Ours (i) Ground-truth

Figure 15: One failure case on ‘0006 20160722 101752 239’ of full resolution HDR+ burst photography dataset. Our modelshows the best local contrast in most cases, however, all methods suffer from banding artifacts in extremely dark regionswhere Signal-to-Noise Ratio is poor. Since our inputs are of 8-bit, too little useful information is carried in extremelydark areas and noises dominate signals. Possible solutions including cooperating with other denoising preprocessing, andreplacing low-precision 8-bit inputs with some high-precision 16-bit images.

Documents

Real-time Image Enhancer via Learnable Spatial-aware 3D