Subjective Image Quality Assessment With Boosted Triplet

Received July 31, 2021, accepted September 3, 2021, date of publication October 5, 2021, date of current version October 15, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3118295

Subjective Image Quality Assessment WithBoosted Triplet ComparisonsHUI MEN , HANHE LIN , MOHSEN JENADELEH , (Member, IEEE), AND DIETMAR SAUPEDepartment of Computer and Information Science, University of Konstanz, 78464 Konstanz, Germany

Corresponding author: Hui Men ([email protected])

This work was supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Project251654672—TRR 161 (Project A05).

This work involved human subjects or animals in its research. Approval of all ethical and experimental procedures and protocols wasgranted by the Institutional Review Board of the University of Konstanz.

ABSTRACT In subjective full-reference image quality assessment, a reference image is distorted atincreasing distortion levels. The differences between perceptual image qualities of the reference image andits distorted versions are evaluated, often using degradation category ratings (DCR). However, the DCR hasbeen criticized since differences between rating categories on this ordinal scale might not be perceptuallyequidistant, and observers may have different understandings of the categories. Pair comparisons (PC)of distorted images, followed by Thurstonian reconstruction of scale values, overcomes these problems.In addition, PC is more sensitive than DCR, and it can provide scale values in fractional, just noticeabledifference (JND) units that express a precise perceptional interpretation. Still, the comparison of imagesof nearly the same quality can be difficult. We introduce boosting techniques embedded in more generaltriplet comparisons (TC) that increase the sensitivity even more. Boosting amplifies the artefacts of distortedimages, enlarges their visual representation by zooming, increases the visibility of the distortions by aflickering effect, or combines some of the above. Experimental results show the effectiveness of boostedTC for seven types of distortion (color diffusion, jitter, high sharpen, JPEG2000 compression, lens blur,motion blur, multiplicative noise). For our study, we crowdsourced over 1.7 million responses to tripletquestions. We give a detailed analysis of the data in terms of scale reconstructions, accuracy, detection rates,and sensitivity gain. Generally, boosting increases the discriminatory power and allows to reduce the numberof subjective ratings without sacrificing the accuracy of the resulting relative image quality values. Ourtechnique paves the way to fine-grained image quality datasets, allowing for more distortion levels, yet withhigh-quality subjective annotations. We also provide the details for Thurstonian scale reconstruction fromTC and our annotated dataset, KonFiG-IQA, containing 10 source images, processed using 7 distortion typesat 12 or even 30 levels, uniformly spaced over a span of 3 JND units.

INDEX TERMS Subjective quality assessment, full-reference, artefact amplification, zooming, flicker test,triplet comparisons, scale reconstruction, just noticeable difference.

I. INTRODUCTIONFull-reference image quality assessment (FR-IQA) quanti-fies the perceptual image qualities of distorted versions ofpristine reference images. In addition, FR-IQA quantifies thetrade-off between the bitrate and perceived quality in per-ceptual image compression, which helps in the optimisationof encoding parameters. Similarly, the development of other

The associate editor coordinating the review of this manuscript and

approving it for publication was Mingbo Zhao .

image processing applications such as image restoration andenhancement may profit from knowing the expected percep-tual quality of their output images.

Since it is not feasible to assess perceptual image qualityby a subjective study each time in such applications, auto-mated FR-IQA algorithms must be used that estimate thequality from the image data without any human interaction.To develop and train such FR-IQA algorithms, annotatedimage datasets, derived from subjective studies, are required.In such studies, images are judged by subjects according

VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 138939

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-2-bm1abz1uma6c7

https://orcid.org/0000-0001-7069-1765

https://orcid.org/0000-0002-0297-3549

https://orcid.org/0000-0002-8216-1195

https://orcid.org/0000-0001-6735-5103

https://orcid.org/0000-0003-0381-4360

H. Men et al.: Subjective IQA With Boosted TC

to their perceived quality, either individually or in compar-ison with one or more other images. This paper contributesboosting methods for the presentation of the image stimuli insubjective studies that improve the accuracy and sensitivityof the perceptual measurements.

A. SUBJECTIVE FULL-REFERENCE IMAGE QUALITYASSESSMENTIn subjective studies, test stimuli may be presented one ata time and rated according to a 5-point absolute categoryratings (ACR) scale, i.e., Bad (1), Poor (2), Fair (3), Good(4), and Excellent (5). For each stimulus, the integer valuesof the ratings from many subjects are averaged, yieldingcorresponding mean opinion scores (MOS), which serve asscalar perceptual image qualities [1]. ACR is likely to leadto low sensitivity in distinguishing among stimuli of similarqualities. A modified version of ACR, degradation categoryrating (DCR), provides higher sensitivity [2], [3]. In a DCRtest, distorted stimuli are presented with their references,either sequentially or simultaneously side by side. Stimuliare rated according to the 5-point DCR scale, namely VeryAnnoying (1), Annoying (2), Slightly Annoying (3), Percep-tible but not Annoying (4), and Imperceptible (5). Averageratings are called degradation mean opinion scores (DMOS).

The approaches mentioned above, although straightfor-ward, have some limitations.

(1) Observers may have different understandings of thequality categories [4], [5], leading to large variances ofratings and therefore requiring a large number of ratingsto achieve the desired precision of the mean opinionscale of ACR or DCR.

(2) There is the danger of a saturation effect. If a subjectscores an image in the best (or worst) quality category,another item to be judged may come up with a perceivedquality that is even significantly better (or worse). Thenthis item can only be scored with the same category asbefore. There is no way to correct previously assignedquality values to accommodate the overall larger thanexpected dynamic range in quality.

(3) ACR and DCR scales should be regarded as ordinal,not interval scales [4]–[6], even though in practice thecategories Bad, Poor, etc. are linked to the numericalvalues 1, 2, and so on. This means that pairs of stimuliwith an equal difference in MOS, resp. DMOS are notgenerally perceived to have the same perceptual dis-tance.

(4) Given the mean opinion scores s and s + 1s of twoimages, there is no meaningful interpretation for the dif-ference in perceptual quality based on s and1s. It wouldbe desirable to have a scaling property similar to thatprovided by the peak-signal-to-noise ratio (PSNR) forthe case of objective image quality. For example, if twodistorted images have a difference of 1 dB in PSNR,then we know that the mean-square-error in one imageis 100.1 ≈ 1.259 as large as that in the other one.

The pair comparison method (PC) is an alternative to ACRand DCR. In the 2-alternative forced-choice (2AFC) setting,observers are presented with pairs of test images and askedto identify the image in each pair with less distortion, i.e., theimage with better quality. The PC method is an indirectmeasurement, and scale values cannot be generated simply byaveraging ratings. Instead, an algorithm that ‘‘reconstructs’’the latent quality scale values is required

Many reconstruction methods have been proposed. In psy-chophysics, the field that studies the relationship betweenphysical stimuli and the perceived experiences they evoke,methods based on probabilistic models have become thede facto standard for this purpose. In Thurstonian models,the perception of each stimulus is modelled quantitativelyas a scalar Gaussian random variable. The random variablescorresponding to sequences of stimuli under investigation aremost commonly taken to have the same variance of 0.5 sothat quality differences in PC have a unit variance whenthe independence of the random variables is assumed. Thissetting defines the units of measurement. Initially, followingThurstone’s pioneering work [7] in 1927, a least-squaresapproach was taken to solve for the reconstructed scale valuesbased on his model. Nowadays, the method of choice ismaximum likelihood estimation (MLE) [8].

The PC method overcomes all of the above-listed limita-tions of ACR and DCR.(1) Such a comparison does not rely on the particular inter-

pretation of a nominal category of quality. Therefore thetask is clear and more natural than ACR and DCR.

(2) By design, the saturation effect is eliminated.(3) The reconstruction for PC yields quality values on an

interval scale. Specifically, according to the Thursto-nian model, pairs of stimuli with an equal difference inscale value are perceived as having the same perceptualdistance in the sense of the old, famous psychologicalrule of thumb: Equally often noticed differences areequal, unless always or never noticed [9]. Thus, for adifference of1s on the perceptual quality scale, the pro-portion of subjects who consider the stimulus with thelarger latent scale value to be the one with better qualityis a function of 1s alone.

(4) By appropriately choosing the variance of the Gaussiandistribution in the Thurstonian model, we can define theperceptual scale such that one unit difference betweenthe two values of a pair corresponds to a fraction of 75%of the subjects choosing the correct better quality itemamong the pair. This corresponds to the usual definitionof the just noticeable difference (JND): The JND is theperceptual quality difference for which the probability ofdetecting the better quality image is 50%. In PC (2AFC)for this case, therefore, half of the subjects will detectand choose the correct stimulus as the better one, whilethe other half has to guess and will be correct half of thetime, leading to the 75% ratio.

Based on the first point above, forced-choice PCs are easierto decide in subjective trials and require less time than the

138940 VOLUME 9, 2021


ACR or DCR categorization task. In addition, the PC methodyields the lowest measurement variance and thus provides themost accurate results [10].

In a slight variation of PC, the reference stimulus is placedin the middle of the pair of stimuli to be compared. Thetask of the observers is to select the one that looks more(or less) similar to the reference [11]–[14]. Similar to DCR,assessing the (dis)similarity of the distorted stimuli to thereference should lead to a more appropriate, informed choiceof the subject than a PC without reference. Such an approachcan be seen as a special case of triplet comparisons (TC).In TC tests, three stimuli are displayed simultaneously, andthe (dis)similarity to the stimulus placed in the middle, calledthe pivot, is asked to be compared. In general, the pivot can beany element of the sequence of distorted stimuli (general TC),and the PC with reference corresponds to TC where the pivotis fixed and equal to the reference for all comparisons. In ourmain experimental study, we applied the latter approach,which we call baseline TC. In a secondary experiment,we used general triplets and showed their potential to furtherincrease the performance of FR-IQA compared to DCR andbaseline TC.

Like PC, TC avoids the problems discussed above ofACR and DCR. The subject responses to baseline TC canbe interpreted as answers to the corresponding 2AFC ques-tions that show only the two distorted stimuli with the ref-erence. Therefore, the same Thurstonian reconstruction maybe applied. However, for general triplet comparisons witharbitrary pivot stimuli, this is not possible. Triplet compar-isons had already been introduced in psychophysics by Torg-erson [15] in 1958, and recently found much interest in visionscience andmachine learning [16].Many scale reconstructionmethods for TC have been proposed. However, with just oneexception, none of them are based on the Thurstonian modelallowing to give scale values in meaningful JND units asdiscussed above. Hence, in this paper, we propose a completemethod for scale reconstruction from triplet comparisons,based on Thurstone’s model and MLE, to produce scalevalues expressed in perceptual JND units.

B. CURRENT VISUAL QUALITY DATASETSIn addition to FR-IQA, there are other applications in whicha visual reference stimulus is distorted to various levels ofseverity and compared. For example, in full-reference videoquality assessment (FR-VQA), short image sequences of afew seconds are viewed and evaluated for visual quality.Since video data transmission is the dominant load on theInternet traffic, FR-VQA for compressed video streaming isthe most relevant application of visual quality assessmentmethods. Recently, it has been proposed to replace the qualityassessment scale based on MOS or DMOS from subjectivecategorical or nominal ratings by the just noticeable differ-ence [36]. In JND assessments, the distorted reference imagesor videos are also compared to the reference, and the minimaldistortion level that leads to a perceivable difference of thestimulus is reported. In all of these cases of visual quality

assessment, the boosting techniques that we have developedcan be applied to increase the sensitivity, allowing for a morefine-grained visual analysis of the range of distortions.

In Table 1 we present an overview of the currently availabledatasets for subjectively assessed visual quality for FR-IQA,FR-VQA, and JND. In this paper, we contribute a new dataset,KonFiG-IQA (Konstanz Fine-Grained IQA), which is alsolisted in the table. Several points are noteworthy:

(1) Quality and JND assessment techniques of the currentdatasets are dominated by the classical approaches suchas ACR, DCR, and their variations where a discrete orcontinuous ratio scale replaces the categorical scale. ForTID2008, TID2013, and MDID, baseline triplet com-parisons were carried out. For the scale reconstruction,the Swiss-system tournament style point scoring resp.a ranking procedure based on insertion sort was used.The only dataset for which a probabilistic MLE-basedreconstruction from comparisons was carried out isMCL-V, without final conversion into JND units. Here,the Bradley-Terry model was used, which is very similarto the Gaussian Thurstonian model. Thus, in all currentIQA and VQA datasets, possibly except for MCL-V,artefacts due to the nonlinear scaling of perceptual qual-ity may be present.

(2) In all current IQA and VQA datasets, only a small num-ber of distortion levels were applied, up to 6 for imagesand up to 11 for video. This choice corresponds to thesmall number of only 5 nominal quality values availablein ACR and DCR. It would be desirable to introduceIQA and VQA datasets with a larger number of distor-tion levels, especially at the high end of quality. Thiswould allow for training machine learning techniquesfor objective quality assessment aimed at applicationsdelivering high quality imagery and streaming videoover the Internet at a minimal but sufficient bitrate.

(3) Two recent trends can be observed. In 2019,the first crowdsourced FR-IQA dataset was introduced(KADID-10k), and more are likely to come, like the twosets included in this paper. Moreover, since 2016 thefirst datasets for JND were established both for imagesand video, and more of them can be expected, includingcrowdsourced JND datasets.

The new dataset from our study, KonFiG-IQA, stands outfrom the rest of the FR-IQA datasets in the following aspects:

(1) The number of distortion levels is larger and designedby perceptual consideration, namely 12, resp. 30 dis-tortion levels, perceptually equally spaced over a rangeof 3 JND.

(2) The number of ratings, resp. triplet comparisons, aver-aged per distorted image, is much larger, with 97 perimage in Part A and up to 875 in Part B. This allowed foran extensive analysis of the reliability and convergenceof the resulting scale values.

(3) The scale values are derived from the probabilis-tic Thurstonian MLE process and converted to give

VOLUME 9, 2021 138941


perceptually linear quality scale values in meaningfulJND units.

C. BOOSTING FOR VISUAL QUALITY ASSESSMENT:MOTIVATIONWhen designing visual quality datasets, the creators usu-ally try to cover the complete range of visual quality withonly a few samples taken from a very large pool of imagesor videos. In the case of FR-IQA or FR-VQA, for eachsource stimulus, a large range of values for the distortionparameters of the chosen distortion types may be applied,yielding stimuli that are very close to the original as wellas others with severe distortions. It may be hard to reliablyassess the resulting small and large quality differences insubjective quality assessment experiments. Let us illustratethis, assuming the Thurstionian model for visual qualityimpairment.

Consider a sequence of M + 1 images I0, . . . , IM , whereI0 is a pristine source image and I1, . . . , IM are increasinglydistorted versions of the source. Let their image qualities onthe impairment scale be modelled by random variables withGaussian distributions with means µ0, . . . , µM and standarddeviations equal to σ =

√0.5/8−1(0.75) ≈ 1.0484, where

8 is the normal cumulative distribution function (CDF). Thisparticular choice of the variance scales the quality valuesto be expressed in convenient JND units as pointed out inSubsection I-A.

When comparing two such stimuli that are close to eachother in their mean values, the corresponding effect size 2determines the difficulty of assessing their difference. A com-mon way to define the effect size is the standardized meandifference. In our case,

2 =|µi − µj|

σ≈ 0.95391µi,j.

Smaller effect sizes indicate the necessity of larger sam-ple sizes. Effect sizes 2 ∈ [0.2, 0.5), [0.5, 0.8), [0.8, 1.3)and 2 ≥ 1.3 are called small, medium, large, and verylarge, respectively. For example, at 1µ = 1 JND, we have2 = 0.9539 ∈ [0.8, 1.3), and thus, a large effect size. Thisis in line with the detection rate of 50% at 1 JND qualitydifference. However, for differences 1µ < 0.2097 JND wehave 2 < 0.2 and therefore a very small effect size.Typically, image quality datasets have hundreds of images

with perceptual qualities ranging over just a few JND. If onewere to compare these images to each other, such small effectsizes would become relevant. Therefore, quality assessmenttechniques that enlarge the effect size would be beneficial,allowing one to distinguish image qualities with small dif-ferences with a smaller number of samples. For this pur-pose, we propose and study our boosting methods in thiscontribution.

It is quite natural that small visual differences are difficultto assess. It has been observed, in addition, that large qualitydifferences are hard to quantify: Stimulus differences largerthan about 1.5 JND cannot be reliably assessed by the human

visual system, presumably due to a kind of saturation effectby overwhelming noise [37].

In addition to the problem of subjectively quantifying largedistortions reliably, there is the numerical problem of recon-structing such large quality differences from paired compar-isons with the subsequent Thurstonian scale reconstruction.In the Thurstonian model, a quality difference is given bya normal random variable with unit variance and the meanequal to the quality difference on the perceptual quality scale.Thus, a fraction p ∈ (0, 1) of observations that correctlyidentify the better quality image in the given pair gives rise tothe reconstructed quality difference 8−1(p), where 8 againdenotes the normal CDF. However, when the quality differ-ence is large, most observers (k out of n) will agree on whichimage is of better quality. Therefore, the fraction p = k/n isclose to 1 and 8−1(p) depends very sensitively on p whenp is near 1 or 0. For k = n, we even have p = 1 and8−1(1) = ∞. So, if just one observer would change his/herresponse, the reconstructed quality difference between thestimuli would drastically change.

We analyse this effect by simulating subjective PC usingthe probabilistic model to compute Erms, the root of theexpected square error of the reconstruction as follows.We assume a quality difference of1µ and collect n responsesfor the corresponding pair comparison. Then we make use ofthe binomial distribution with probability 8(1µ) to get theresult.1

Erms(1µ, n) =

[n∑

k=0

(nk

)8(1µ)k (1−8(1µ))n−k

×

(8−1

(kn

)−1µ

)2]1/2

Figure 1 illustrates this root-mean-square error (RMSE) asa function of1µ for n = 5, 10, 20, 40. For increasing qualitydifferences, we see that Erms is stable and nearly constantuntil about 2 or 3 JND. From then on, the error increasesapproximately linearly. This gives another reason to restrictpaired comparisons (resp. triplet comparisons) to cases whereimage quality differences are not too large, i.e., up to about2 or 3 JND.

With our boosting methods implemented to enlarge thedistortions applied to source images, this effect of decreasedpsychovisual sensitivity and increased reconstruction noise atlarge distortion levels can be overcome partially, as our exper-iments will show. However, by boosting image differencesalso between distorted images, we will show that also forlarge distortions, fine-grained quality scaling can be achievedreliably.

1The straightforward implementation of this formula will lead to the so-called zero-frequency problem when k = 0 or k = n. In these cases,8−1

(kn

)will be ±∞, rendering a reconstruction infeasible. To avoid this

problem, it is common practice to install a ‘prior’ by adding half a vote toeither option, thus computing 8−1

(k+0.5n+1

), which is what we have done

here as well.

138942 VOLUME 9, 2021


TABLE 1. IQA/VQA/JND datasets with artificial distortions.

D. BOOSTING BY ARTEFACT AMPLIFICATION ANDZOOMINGThe approach of boosting for visual quality assessment isto enlarge the differences between stimuli artificially. In thebasic setting, using baseline triplets for comparison, we areasking subjects to identify the distorted image of the pre-sented pair that most closely resembles the reference image,which is equal to the source image for generating the distortedversions. Here it is the distortions in each derived imagethat need to be enlarged for the boosting effect. Distortionstypically occur in two main aspects, in terms of greyscaleintensity, respectively color, or spatially. Therefore, the firsttwo boosting techniques are as follows:(1) Artefact amplification (A): In a triplet comparison,

the similarity of two distorted images with respect tothe pivot image has to be judged. Artefact amplificationscales the pixel-wise differences of each distorted imagein the three color channels linearly.

(2) Spatial zooming (Z): A linear scaling of the size ofthe image enlarges the visual representation of imagedifferences. Due to limitations on the available screensize, spatial zooming may make it necessary to crop thedistorted and zoomed image.

In Figure 2 we show an example of how this boosting revealsotherwise invisible or hard to detect distortions caused byJPEG2000 compression.

E. BOOSTING BY THE IMAGE FLICKER METHODIn PC, the image stimuli are usually displayed on a screenside by side. To detect a small detail that differs betweentwo images, the observer must search over both images andmemorize the last examined detail of one image when the eyefixation point moves to the corresponding location in theother image. This task can be difficult. It is at the core of the

VOLUME 9, 2021 138943


FIGURE 1. In this simulation, we consider the reconstruction of a qualitydifference from a set of n pair comparisons of two stimuli with a givendifference of mean observed qualities, shown on the horizontal axis.On the vertical axis, the resulting RMSE of the reconstruction of thedifference is shown. This simulation confirms that it is not advisable toapply paired comparisons when the difference to be assessed is so strongthat nearly all observers agree on which of the two stimuli should bechosen as the better (or stronger) one.

popular fun game for kids, where differences between twoseemingly identical comic drawings are to be found.

The change detection task can be simplified by displayingthe two images in the same screen space, one in the fore-ground and the other one invisible in the background. Theobserver can toggle the view between the two images usingkeystrokes or mouse clicks. In this setting, the eye needs toscan only half of the screen space, and the saccadic eye move-ments between the two images are replaced by key clicks.Moreover, from an evolutionary perspective, it is plausiblethat the fundamental ability of human perception to detecta change in the visual environment has developed to a highlevel. Thus, small changes in the visual field should be easierto detect than the differences between two objects (or images)next to each other.

The visual sensitivity to contrast change has beenresearched for a long time [38]. For simple test scenes,contrast is defined as the ratio of the target intensity to thebackground intensity. When the contrast changes periodi-cally, e.g., in a sinusoidal fashion, the change becomes visiblewhen its amplitude surpasses a certain threshold, called thecontrast threshold. Its inverse is the contrast sensitivity, and itsvariation as a function of temporal frequency can be describedby the temporal contrast sensitivity function (TCSF). For asufficiently high luminances, the contrast sensitivity reachesa maximum of about 200 near a frequency of 8Hz.

In 2014, the above concepts were applied for the first timein an image flicker viewing method for subjective assessment

FIGURE 2. This figure shows an example of boosting by artefactamplification and zooming. The top left image is an original, undistortedsource image. On the right, a zoomed crop of it is pictured. The sourceimage was compressed by JPEG 2000 with a compression ratio of 1:47.34,resulting in the lower-left image. The distortions due toJPEG 2000 compression are barely visible. However, on the lower right,the same compressed image is shown after artefact amplification andzooming. Now, by this boosting technique, the distortions in the flowerpetals and stigma became emphasized and clearly visible.

of barely visible image compression artefacts [39]. Observerswere presented with an original reference image, temporallyinterleaved with a test image, which was reconstructed fromthe compressed reference. The flicker frequency was chosenas 7.5Hz, close to the maximum of the TCSF. This methodwas expected to make even subtle artefacts visible that wouldbe undetectable in a side-by-side comparison. The paper didnot compare the performancewith that achievable by the side-by-side display, however.

In our work, we provide such studies by including theflicker viewing method as our third option of boostingtechniques:(3) Image flicker (F): Two images to be compared are dis-

played temporally, interleaved at a frequency of 8Hz.The application of the flicker viewing technique in a tripletcomparison requires adapting the visual appearance of thedisplayed scene. The triplet (i, j, k) has the pivot image Ij,and we are asking the observer to determine whether the

138944 VOLUME 9, 2021


perceived differences in the left image pair (Ii, Ij) are largeror smaller than those in the right pair, (Ik , Ij). Therefore, withthe flicker viewing technique, we show two flickering imagesside by side: On the left, image Ii alternates with the pivot, andon the right, it is Ik that alternates with the pivot.

F. CONTRIBUTIONSWe expect (and we will show) that the boosting methods, out-lined in the previous subsections, increase the measurementsensitivity on the visual impairment scale, enabling the detec-tion of more subtle artefacts. In our experiments, we inves-tigated the performances by measuring sensitivities withrespect to themagnitude of the applied distortion.We selectedten source images from the MCL-JCI dataset [29], eachof which was distorted by seven types of distortion: colordiffusion, jitter, high sharpen, JPEG2000 compression, lensblur, motion blur, and multiplicative noise. Figure 3 showsthe flowchart of our boosted triplet comparison method forsubjective IQA.

Along with the boosting of sensitivity, however, we haveto accept that the absolute values of impairment, given inJND units, will be different and typically larger than thoseobtained using plain pair or triplet comparison or by using theDCRmethod. For example, if a particular distortion producesan impairment of 1.5 JND, measured by plain comparison,we may obtain a much larger impairment of perhaps as muchas 3 JNDwhen using one of the boosting methods. Therefore,we have also devised a method to adaptively transform theboosted impairment quality values back so that they approx-imately match the range of impairment scales as measuredby plain comparison, however, without sacrificing the betterdiscrimination ability.

To summarize, in this paper, we present the first studyon the potential of perceptual boosting techniques in thecontext of subjective image quality assessment. The maincontributions are the following:

1) We propose three boosting strategies (artefact amplifi-cation, zooming, and flicker) that can enlarge the sensi-tivity of pair and triplet comparisons, as well as increasethe accuracy of visual quality assessment.

2) We propose a method based on Thurstone’s model andMLE to reconstruct the perceptual qualities of imagesfrom triplet comparisons.

3) We generate an IQA dataset of 1140 images with distor-tions from 10 source images. We provide the responsesto all triplet comparisons from a large subjective crowd-sourcing campaign together with the reconstructed qual-ity values from plain triplet comparison and seven typesof boosted triplet comparison. This dataset will be madeavailable after publication.

4) We provide an extensive performance analysis ofboosted triplet comparisons for image quality assess-ment, including measures of the true positive responses,detection rates, sensitivity, effect size, convergence, cor-relation, and time complexity.

G. GLOSSARYSequence. A sequence is a list of images (I0, I1, . . . , IK )

where I0 is a pristine reference image and the others areincreasingly distorted versions of it. The index k of Ik iscalled the distortion level of Ik .

Triplet. A triplet is a list (i, j, k) of three non-negative indicesreferring three stimuli (Ii, Ij, Ik ) of an image sequence.In a triplet comparison, the observer is asked to deter-mine whether the left or the right image (Ii or Ik ) isperceptually closer to the so-called pivot image Ij in themiddle.

Baseline triplet. A triplet of the form (i, 0, k), where thepivot is the pristine reference image I0.

General triplet. A triplet of the form (i, j, k), where thepivot could be any image (could be the pristine referenceimage or a distorted one).

Pivot. The stimulus that is placed in the middle of a triplet.JND. The natural unit on the perceptual scale. The percep-

tual difference between the two compared images is1 JND (just noticeable difference) when the differenceis perceived by a random observer of the population ofobservers with a probability of 0.5.

HIT. A human intelligence task is a self-contained, virtualtask that a worker can work on. A HIT may containseveral questions to be answered.

Assignment. A completed HIT that is submitted by a uniqueworker.

Response. Equals vote or answer for a pair or triplet com-parison.

Rating. Selected image quality or degree of impairment inan ACR or DCR assessment.

Plain. A plain triplet comparison is a conventional one,where the images to be compared are not processed(boosted).

A, Z, F. Abbreviations for triplet comparisons with artefactamplification, zooming, and flicker, respectively.

AZ, AF, ZF, AZF. Abbreviations for triplet comparisons ofcombinations of A and Z, A and F, Z and F, and A, Z,and F.

II. RELATED WORKA. BOOSTING BY ARTEFACT AMPLIFICATION ANDZOOMINGAmplification and zooming are common techniques appliedin image and video processing, for the purpose of highlightingdetails in an image or for overall image enhancement. Forinstance, histogram equalization enhances contrast by enlarg-ing small intensity differences. Image sharpening enhancesthe appearance of edges, e.g., by adding to the input imagea signal that is a scaled high-pass filtered version of theoriginal image.Motion and color magnification reveals subtlevariations in video sequences that cannot be perceived byhuman senses, such as heartbeats showing in faces [40], andcan even reconstruct sounds from videos of objects subtlyvibrating in response to those sounds [41]. Exaggeration also

VOLUME 9, 2021 138945


FIGURE 3. Flowchart of the proposed boosted triplet comparison method for subjective IQA. Each pristine source image is distorted by seven types ofdistortion at N levels, respectively. For a given distortion type, samples of three distorted images are drawn, and for each triplet, subjectiveassessments are made as to whether the left or the right image is perceptually closer to the center, pivot image. Three boosting strategies(amplification, zoom, and flicker) are deployed individually or in combination to enhance the perceptual sensitivity of comparisons between distortedimages. With the flicker option turned on, only two (flickering) images are shown side by side. The yielded triplet comparison results are to be used toestimate image quality impairment scales using Thurstonian reconstruction.

is one of Disney’s twelve basic principles of animation [42],applied in particular to motion. The cartoon characters weredesigned to maintain the illusion that they follow the laws ofphysics, however, in a wilder, more extreme form.

In spite of the widespread applications of amplification andzooming in multimedia applications, we have not becomeaware of any previous work on adapting, applying, and val-idating these techniques for the purpose of subjective visualquality assessment.

The only exception is our earlier work [43], in which weapplied image distortion amplification and zooming to prop-erly cropped frames to compare interpolated video frameswith the corresponding ground-truth frames. However, thistechnique was a side issue in that contribution, and there wasno systematic analysis of the performance and potential ofamplification and zooming.

B. BOOSTING BY THE IMAGE FLICKER METHODIn [39], flicker tests were proposed for the task of determiningthe JND of distorted images. In order to test for noticeabledistortions in compressed images, two images were placedside by side. One of them was a still reference image.

The other was an animation in which a reference imagealternated with the distorted image at the same spatial posi-tion. The display frame rate was 30 fps, and the alternationoccurred every four frames, yielding a flicker frequencyof 7.5Hz. Observers were asked to identify which of the twoimages was the still (or non-flickering) one (two-alternativeforced-choice).

Shortly later, the ISO/IEC standard 29170-2:2015(E)was set up with recommendations for the subjective qual-ity evaluation of single frames from JPEGXS encodedvideos [44]. The focus of the standard was on visuallylossless, low-latency, and lightweight video compressionschemes. Therefore, subjective tests were prescribed forimage JND assessment rather than conventional MOS fromACR or DCR procedures. The standard directly follows theideas in [39], with one notable exception: The flicker rate wasrecommended to be 10Hz instead of 7.5Hz.

The first study, based on the new standard, was pub-lished in 2017. In a large lab experiment with 120 subjects,the flicker test was used for video frames, produced by theVideo Electronics Standards Association (VESA) DisplayStream Compression, which is a lightweight codec designed

138946 VOLUME 9, 2021


for visually lossless compression [45]. An in-depth discus-sion of this application of the flicker test for the criterionof perceptually lossless compression as prescribed in theISO/IEC standard was presented in [46]. The authors’ conclu-sionwas that ‘‘if the goal is to conservatively evaluate the pos-sibility that a compression artefact might be visible under anysituation, then the flicker paradigm is a viable approach as ithighlights differences between images regardless of whetherthey are noticeable in the absence of a reference.’’ However,a quantitative comparison of the sensitivity of the flicker testprotocol versus the traditional side-by-side presentation wasnot included.

In 2016, a JPEGXS Call for Proposals with subjectivequality evaluations based on the flicker test of the ISO/IECstandard was issued [47]. The results of the submissions tothe call were summarized in [48]. Different from the ISO/IECstandard recommendations, a flicker frequency of 8Hz wasapplied, and subjects were given a third option for theirresponse, namely to cast a no-decision vote. This was deemedto alleviate subject fatigue.

In other recent work, the flicker test was applied to a paletteof different image modalities: high dynamic range (HDR)images [49], foveated images in head-mounted displays(HMD) [50], and stereoscopic imagery [51].

In our pre-study [52], presented at the ICME 2020 Work-shop on Data-driven Just Noticeable Difference for Multime-dia Communication, we have provided the first experiment tocompare the performance of the flicker test with conventionalside-by-side comparisons. The purpose of the tests was toassess the JND for JPEG image compression. As a result,we reported that the flicker test was about twice as sensitiveas the classical side-by-side comparisons with forced choice.However, this experimental study was small, and the focuswas rather on a new adjustment method for JND detectionusing a slider-based design. Moreover, the flicker tests weredone in a lab situation, while the classical 2AFC pair com-parison ran on a crowdsourcing platform. So the results ofthe comparison regarding sensitivity is only a preliminary.Our contribution here provides a much more elaborate studytargeted specifically at the validation of the performance ofseveral boosting techniques, with flicker being one of them.

The previous works on the image flicker method men-tioned above have applied flicker in a single stimulus orin a double stimulus method where one of the two stimuliwas a still image. We extend these procedures by comparingtwo flickering stimuli in the context of a triplet comparison.Moreover, in past approaches, the flickering was between theundistorted reference image and a test image. We will showthe advantages of considering flicker images in comparisonswhere the flicker is between two test images.

C. RECONSTRUCTION OF SCALE VALUES FROM TRIPLETCOMPARISONSDirect quality assessment proceeds by collecting and averag-ing quality ratings from a sufficiently large set of observers.Absolute category rating is the most common technique in

visual quality assessment. Scale value reconstruction is anindirect procedure, deriving the scales of latent variables fromthe pair comparisons of the perceptual quality or from thecomparison of quality differences in triplets or quadruplets.Other approaches are possible, like reconstruction from therankings of images in subsets of stimuli. For the applicationof boosting methods in subjective visual quality assessment,indirect methods seem more appropriate because boostingenhances the perception of differences between a test stim-ulus and its corresponding reference.

One of the first indirect approaches, based on the scaling ofperceived distances of stimuli, attained from triplet compar-isons, was proposed in 1952 by Torgerson [53] and named themethod of triads. Although the goal was multi-dimensionalscaling, it is clear that the method can also be used to derivethe scalar values of a latent variable. In a nutshell, for the1D case, the reconstruction is based on a model of randomvariables Xi, i = 0, . . . ,M , for the latent stimuli qualitieswith the assumption that their pairwise distances

Di,j = |Xi − Xj|

are random variables with a normal distribution of unit vari-ance. Then scale values of these distances can be recon-structed from the pairwise comparison of distances arisingfrom the triplet comparisons. This step is analogous to theThurstonian scale reconstruction from pair comparison ofstimulus values, where instead of scales for the random vari-ables Xi, scales for the distances Di,j are considered.However, since the triplet comparisons (i, j, k) give infor-

mation only about differences in distances (namely whetherDi,j < Dk,j), the reconstruction of the distances can bedetermined only up to an additive constant. In [53], the leastsquares solution to solve the problem of the additive constantwas proposed.

At the end, a square matrix of distances(di,j), i, j =

0, . . . ,M is yielded, from which a one-dimensional embed-ding can be generated. One can solve the optimization prob-lem, where estimates for the latent variables are found as aminimizer of a cost function, for example,(

µ̂0, . . . , µ̂M)= arg min

µ0,...,µM

∑i<j

(|µi − µj| − di,j

)2.

The method of triads has been criticized as being adhoc [54], as it does not directly follow the basic setup ofThurstonian models, where the latent variables of the stimulithemselves are normally distributed. Moreover, distances arenon-negative and cannot be modelled accurately by normaldistributions.

In our contribution, we propose a complete solution forthe reconstruction of scale values from triplet questions thatstrictly adheres to the considerations of Thurstonian models.Let us assume such a model of a set of normally distributedrandom variables Xi, i = 0, . . . ,M of equal variance, forthe visual qualities of a corresponding set of stimuli. We willmake use of a formula for the probabilities Pr

(Di,j < Dk,j

)VOLUME 9, 2021 138947


for the outcome of a given triplet comparison (i, j, k), thatwas derived by Ennis et al. [55] in 1988.Let Qi,j,k denotes the empirical estimation of

Pr(Di,j < Dk,j

)given by the fraction of responses to the

comparison for the triplet (i, j, k) that indicate that the leftstimulus Ii is closer to the pivot Ij than the right one, Ik . Thenthe task to be solved is to reconstruct the mean values of themodel such that the model predictions for the TC outcomesmatch the empirical data:

Pr(Di,j < Dk,j) ≈ Qi,j,k .

For this purpose, in [55] the method of least squares wasproposed,

minµ0,...,µM

∑i<k,j6=i,k

(Qi,j,k − Pr

(Di,j < Dk,j

))2.

This is equivalent to theMLE inwhich the prediction errorsPr(Di,j < Dk,j) − Qi,j,k are modelled as independent normalrandom variables with equal variance [56]. This assumptiongenerally cannot hold since for small or large probabilitiesPr(Di,j < Dk,j) near 0, resp. 1, the error distribution neces-sarily must be skewed. Therefore we favor the general MLEmethod, i.e., to maximize the model likelihood of the set ofobservations Qi,j,k .

While this choice follows the common approach taken inpsychometrics [57], the most widely used one-dimensionalscale reconstruction method for vision science applicationsis probably maximum likelihood difference scaling (MLDS).It solves the difference scaling problem of quadruplet ques-tions (i, j, k, l), where the perceptual distance of the first pairof stimuli, (Ii, Ij), is compared to that of the second pair,(Ik , Il), in a 2AFC setting. In MLDS, the decision variableemployed by an observer of such quadruplet questions ismodelled as

Z = |xj − xi| − |xl − xk | + Nσ ,

where xi, i = 0, . . . ,M are the (crisp) unknown qualitiesand Nσ is a zero-mean Gaussian noise term with varianceσ 2. The unknown variance characterizes the difficulty ofthe particular set of quadruplet questions together with theuncertainty of the subjects.

It is worth noting that MLDS presents an approach ofa fundamentally different type than Thurstonian models.In Thurstonian models, the decision variable is Z = |Xj −Xi| − |Xl − Xk | and deterministic, but the qualities on theperceptual scales are uncertain, given by normally distributedrandom variables Xi, i = 0, . . . ,M . In MLDS, it is justthe opposite. The quality values are crisp, while there isuncertainty in the decision variable.

The number of free parameters for the M + 2 unknownsin MLDS is M , and one may set the range of scale values to[x0, xM ] = [0, 1] and solve for the variables x1, . . . , xM−1and σ , using MLE. Alternatively, one can set x0 = 0,the variance σ 2 to a fixed value, and then solve for thescales x1, . . . , xM .

In this paper, we contribute a method for the selectionof the variance σ 2 of the noise term such that the resultingreconstruction yields scale values in approximate JND units.

A recent survey discusses the MLDS method, its varia-tions, and a very large number of applications in differentfields [58]. The two contributions, most closely related to ourwork on visual quality assessment, are [59] and [60] wherequadruplet comparisons for image sequences with distortionsdue to compression were undertaken and analysed byMLDS.

Although MLDS was designed for scale reconstructionfrom quadruplet comparisons, it is clear that it can also beapplied to triplet comparisons (i, j, k), simply by restricting toquadruplets of the form (i, j, j, k). Then the decision variableis Z = |xj − xi| − |xk − xj| + Nσ . In practice, it canbe expected that its normal distribution is very similar tothat for the decision variable Z = |Xj − Xi| − |Xk − Xj|which arises from the Thurstonianmodel (Xi,Xj,Xk normallydistributed with variance 1/2). However, for our particularapplications in visual quality assessment (FR-IQA), we prefera reconstruction method that can produce scale values in JNDunits. MLDS was not designed for this purpose.

There are several other methods for scale reconstructionfrom triplet comparisons, some of which have recently orig-inated from the machine learning community [16]. Usu-ally, these methods are for multi-dimensional scaling. Someof them can be restricted to the one-dimensional case.In Section IV, we compare our results with those computed byMLDS and stochastic triplet embedding (STE) [61]. In termsof correlation with ground truth, all methods showed excel-lent performance. However, as for MLDS, also STE cannotbe expected to yield estimates on JND scales.

Let us finally remark that there also is an ISO standardthat proposed triplet comparison [37]. Observers rate eachimage in a triplet using a 5-point ACR scale, and from that,all three pair comparisons in the triplet are deduced. Theselections of the triplets in an experiment is prescribed andrather restricted. This setting would not support the boostingstrategies discussed in this paper.

III. BOOSTING STRATEGIESWe apply three methods to boost the perceptual sensitivity ofcomparisons between distorted images: 1)Artefact amplifica-tion amplifies the artefacts of distorted images relative to theirreferences, 2) Zooming enlarges the visual representation ofthe images, and 3) Flicker increases the perceptual sensitivityto the distortions by rapidly alternating between distortedimages and their corresponding reference images.

A. ARTEFACT AMPLIFICATION (A)Many ways can be conceived to amplify artefacts due todistortions in images. For this study, we consider one ofthe simplest kinds, namely the linear pixel-wise scaling ofRGB color differences between the distorted and the refer-ence images. Let v, v̂ be the RGB pixel values of a pixel inthe reference and a distorted image, respectively. Then wereplace v̂ by v̂′ = v+ α(v̂− v).

138948 VOLUME 9, 2021


FIGURE 4. Illustration of artefact amplification and zooming. The upper row shows an original source image and its distorted version when jitter isapplied, at a level corresponding to 0.5 JND. The artefacts are amplified with factors α = 1.5, 2, and 3 in the three top right images. Differences betweenthe reference and the distorted image are barely visible, but with increasing amplification, they become more noticeable. The bottom row presents azoomed version of the upper row. The visibility of the distortions is further enhanced by zooming.

Algorithm 1 Pixel-Wise Artefact Amplification1: α← 2 F default amplification factor2: v← (vr, vg, vb) F ground truth pixel3: v̂← (v̂r, v̂g, v̂b) F distorted pixel4: for c ∈ {r,g,b} do5: if v̂c − vc > 0 then6: αc,max← (255− vc)/(v̂c − vc)7: else if v̂c − vc < 0 then8: αc,max←−vc/(v̂c − vc)9: else

10: αc,max← α

11: end if12: end for13: α← min(α, αr,max, αg,max, αb,max)14: v̂′← v+ α(v̂− v) F amplified pixel v̂′ ∈ [0, 255]3

The multiplication by the factor α > 1 ensures consistencywith Fechner’s law [62], which states that the subjectivesensation is proportional to the logarithm of the stimulusintensity. In our context, this means that equal relative incre-ments of distortion, i.e., the same factor α applied in artefactamplification, should correspond to equal increments of per-ceived impairment in these images.

However, due to the finite range of RGB color componentsin digital images, the linear scaling is limited and RGB pixel

TABLE 2. Fraction of pixels clamped in artefact amplification of Figure 4.

values exceeding the limit must be clamped. Thus, to restrictthe RGB components of v̂′ to the range [0, 255] for 24-bitcolor images, we reduce α accordingly for those pixels whereclamping is needed. Note that this may cause a local nonlin-earity and saturation effect of the artefact amplification. SeeAlgorithm 1 for details.

An example of artefact amplification is shown in Figure 4.Comparing the distorted image with the reference image inthe first row, the distortion is hardly visible. In contrast,some distortions are noticeable after artefact amplificationand become more and more obvious with the increase of theamplification factor (top right row).

Table 2 shows the fraction of the pixel color componentsand the overall number of pixels that are clamped in theamplification process, for the example shown in Figure 4.These fractions are monotonically increasing with the ampli-fication factor α. In order to avoid a widespread nonlinearityand the saturation effect due to clamping too many pixels,

VOLUME 9, 2021 138949


FIGURE 5. Overview of plain triplet comparison and seven types of boosted triplet comparison. In plain triplet, three stimuli, selectedfrom an image sequence with increasing distortion levels, are displayed side by side. The task is to judge which side is perceptuallycloser to the pivot in the center. The three boosting techniques, artefact amplification, zooming, and flicker, help to improve theaccuracy, reliability, and speed of the subjective assessment. When boosting with flicker, the left and right images are displayed side byside, each alternating with the pivot eight times per second. In this case, observers judge which side has the stronger flicker effect.

α should be chosen appropriately. Note that, depending onthe application, more or less strong amplification can be used.In triplet comparisons, we applied the amplification relativeto the pivot stimulus displayed in the center. Therefore, forbaseline triplets of the sort (i, 0, k) the differences that areamplified, are typically larger than for most general triplets(i, j, k) with i < j < k or i > j > k . Thus, a moreconservative (i.e., smaller) amplification factor α should beused for baseline triplets. In this paper, we set α = 2 as theartefact amplification factor.

B. ZOOMING (Z)Apart from enlarging color differences by amplifying arte-facts, artefacts appear more visible when enlarged spatially.In fact, participants in an IQA experiment may be temptedto enlarge the images displayed in their browser or to movecloser to the screen to detect fine differences between images.However, to ensure a uniform and controlled quality assess-ment, participants are asked to refrain from such ad hoczooming actions. Instead, we propose to deliver the displayedimages already in a zoomed and cropped fashion.

Figure 4 (bottom row) shows an example. Images arecropped to half their linear size and zoomed by a factor of two.The cropped regions were manually selected, and bicubicinterpolation was adopted for scaling up. Distortions in thezoomed images are more clearly visible, especially as theartefact amplification factor α increases.

Similarly to artefact amplification, larger zoom factors arebetter for the visual detection of artefacts, but at some point,undesirable side effects like pixelation set a limit on zooming.Due to the required cropping, only a part of the image con-tent is maintained in the zoomed images, possibly maskingimage areas with more severe local distortions. In this paper,we chose a fixed zoom factor of two.

Regarding the area for cropping, we manually cropped theimages based on the content. We cropped to a reasonableand sensible scene. For instance, cutting half of a face wasavoided.

C. FLICKER (F)The visibility of distortions can also be enhanced by makinguse of the flicker effect as explained in Subsection I-E. In this

138950 VOLUME 9, 2021


scenario, a distorted image and its reference are displayedsuccessively at a frequency of 8Hz. As already mentioned,for a triplet comparison (i, j, k), using the flicker technique,we show two flickering images side by side, on the left, imageIi alternates with the pivot Ij, and on the right Ik alternateswith the pivot. Observers are asked to select the side that hasa stronger flickering effect.

D. COMBINATIONS OF BOOSTING TYPESWith the three boosting options on hand, we can applythem individually or in combination, for example, zoomingtogether with artefact amplification. This gives rise to sevencases that we abbreviate with the letters A, Z, and F assignedto the boosting methods artefact amplification, zooming, andflicker, respectively. The combinations are A, Z, F, AZ, AF,ZF, and AZF. Without boosting, we obtain plain results,which serve as a reference when assessing the performanceof the boosting methods. Figure 5 shows an overview ofthe different kinds of boosting options as applied for tripletcomparison.

IV. THURSTONIAN RECONSTRUCTION FROM TRIPLETCOMPARISONSLet us consider a set of M + 1 stimuli. In our experi-ments, these are a source image I0 together with deriveddistorted images I1, . . . , IM . The magnitudes of the per-ceived stimulus qualities are taken to be unknown latentvariables, modelled by normally distributed real-valued ran-dom variables X0, . . . ,XM with variance equal to 1/2. It isthe purpose of reconstruction to estimate their means µ =(µ0, . . . , µM ) ∈ RM+1 from a collection of responses to sub-jective triplet comparisons of stimuli. This setup correspondsto the assumptions Thurstone established as Case V in hisanalysis for pair comparisons [7].

In Subsection IV-A, we present formulas for the compu-tation of the probabilities of the responses for the tripletcomparisons, followed by MLE of the means in Subsec-tion IV-B. In Subsection IV-C, we compare the reconstructionperformances of MLDS, STE, and our method by means of asimulation with available ground truth data. We also comparethe probabilistic model for the decision random variable inMLDS with the uncertainty of the means in the Thurstonianmodel. This gives rise to a choice of the unspecified varianceof the MLDS decision variable such that the reconstructionsof the means of the stimuli values on the perceptual scale aregiven in approximate JND units.

A. FORMULAS FOR THE PROBABILITIES OF THERESPONSESWe define that for a triplet t = (i, j, k) with i, j, k ∈{0, . . . ,M}, a subjective comparison yields a response Rijk =1, if the observer judges the left stimulus, numbered i, closerto the pivot stimulus j than the right one, k . Otherwise,the response is Rijk = 0.

Algorithm 2 Probability of a Response Rijk ∈ {0, 1} to aTriplet Comparison (i, j, k)1: µ = (µ0, . . . , µM ) F stimuli means in model2: u0← µk − µi3: v0← (µk + µi − 2µj)/

√3

4: p← 1−8(u0)−8(v0)+ 28(u0)8(v0)5: if Rijk = 1 then F stimulus i closer to j than k6: Return p7: else F stimulus k closer to j than i8: Return 1− p9: end if

From the Thurstonian probabilistic model, it follows thatobservers act according to the sign of the decision variable

Zijk =∣∣Xk − Xj∣∣− ∣∣Xi − Xj∣∣ , (1)

such that

Rijk =

{1 if Zijk > 00 if Zijk ≤ 0

. (2)

Next, we first give an expression, based on a result of [55],for the probability that the decision variable is positive. It willbe a function of the unknown means µ = (µ0, . . . , µM ),so we write it as the conditional probability Pr(Zijk > 0 |µ).Given a triplet comparison t = (i, j, k), the probabilities forthe response Rijk = 1 (left stimulus i is closer to the pivotstimulus j than stimulus k) and the opposite, Rijk = 0, is

Pr(Zijk > 0 |µ) = 1−8(µk − µi)−8(µk + µi − 2µj

√3

)+ 28(µk − µi)8

(µk + µi − 2µj

√3

)Pr(Zijk ≤ 0 |µ) = 1− Pr(Zijk > 0 |µ). (3)

Algorithm 2 summarizes the computation.Other probabilistic models differ from the above by

specifying a different probability for the triplet responses.In MLDS [63], we have

Zijk =∣∣µk − µj∣∣− ∣∣µi − µj∣∣+ Nσ .

In this case,

Pr(Zijk > 0 |µ) = 8

(∣∣µk − µj∣∣− ∣∣µi − µj∣∣σ

). (4)

The default for the parameter is σ = 1. In stochastictriplet embedding (STE, [61]), the probability for a positiveresponse is given directly as

Pr(Zijk > 0 |µ) =e−α(µi−µj)

2

e−α(µi−µj)2+ e−α(µk−µj)

2 . (5)

The parameter α > 0 is not contained in the originalmethod. Thus, its default value is α = 1. Here, we haveintroduced it for the purpose of model calibration.

In the special case of baseline triplets of the form (i, 0, k),we have that the response Ri0k = 1, i.e., that the left stimulus,numbered i, is closer to the pivot 0 than the right stimulus k ,

VOLUME 9, 2021 138951


may also be interpreted as the judgement that the impairmentin stimulus k is greater than the impairment in stimulus i.In effect, this amounts to a response to a regular pair com-parison, and the probabilistic model for the decision randomvariable simply becomes

Pr(Zi0k > 0 |µ) = 8(µk − µi) . (6)

The difference from the normal interpretation of a tripletcomparison is that here the impairment of the pivot is fixedto be equal to 0. In the general triplet comparison, however,all stimuli are modelled as random variables.

B. MAXIMUM LIKELIHOOD ESTIMATION OF THE MEANSFor the actual reconstruction by the maximum likelihoodmethod we take as input a finite multiset T of annotatedtriplets, (i, j, k,Rijk ), where Rijk ∈ {0, 1} is the response to thetriplet comparison (i, j, k) as in the above part. In subjectivequality assessments with triplet comparisons, each triplet maybe presented multiple times, collecting a response each time.Thus, T may contain multiple copies of both, (i, j, k, 0) and(i, j, k, 1). To keep the notation simple, we trust that it is clearfrom the context what Rijk refers to in each case. Assumingthat the responses are independent, we have that the negativelog-likelihood of this data under the model assumptions isgiven by

L(µ) = −∑

(i,j,k,Rijk )∈T

pRijk (1− p)1−Rijk , (7)

p = Pr(Zijk > 0 |µ).

The MLE estimate of the latent variable then is given by

µ̂ = argminµ=(µ0,...,µM )

L(µ).

In our experiments, we allowed a third option for tripletquestion responses, namely an answer not sure (see Subsec-tionV-B). To account for such undecided responses, we sim-ply assign the value Rijk = 1/2 to the corresponding triplets(i, j, k) and use Equation (7) as given.There are many algorithms for such nonlinear optimization

problems, and generally, there is no guarantee that the globalmaximum will be attained. In our computations, we usedthe ‘‘fmincon’’ function in MATLAB, which is a nonlinearprogramming solver that finds the minimum of a constrainednonlinear multivariable function.

Solutions to this optimization are unique up to an additiveconstant. This constant can be chosen arbitrarily, and we haveused this option to align all reconstructions such that thereconstructed scale value for the undistorted reference stimu-lus I0 is µ0 = 0. These reconstructions of impairment scalesare not yet in JND units because their probabilistic modelassumed Gaussian distributions with the variance of 0.5.Since the JND unit corresponds to a value of 8−1(0.75) ≈0.6745, we divide the results by that to obtain impairmentscales in JND units.

FIGURE 6. Reconstructions from a sample of 20 000 simulated randomtriplet question responses for 31 stimuli, spread over a range of 3 JNDunits. All three reconstruction methods yielded excellent correlationsof 0.99, see Table 3, but only our reconstruction also reproduced thecorrect range of the means.

C. COMPARISON OF TRIPLET RECONSTRUCTIONALGORITHMSGiven that there are a number of available methods for thereconstruction of latent scale values from triplet comparisons,the question arises of which of them is the most suitableone to process subjective responses to triplet comparisonsfor visual quality assessment. For this purpose, we considerin this subsection simulated responses to triplet comparisonsbased on the Thurstonian model, i.e. the impairment scale ofa distorted image is given by a normally distributed randomvariable with a correspondingmean and variance equal to 1/2.

For the reconstruction by MLE, we compute the likeli-hoods from one of the equations (3), (4), and (5), corre-sponding to our proposed reconstruction, MLDS, and STE,respectively. We expect that our proposed method givesthe most accurate approximations because Equation (3) isdirectly derived from the Thurstonian model and the othersare not. However, in terms of time complexity, each evalua-tion of the decision probability Pr(Zijk > 0 |µ) requires twoevaluations of the normal CDF, while MLDS needs only one,and STE none.

For baseline triplets of the form (i, 0, k), where the pivotis given by the undistorted source image I0 as a reference,we may interpret the response also as a response to a tra-ditional pair comparison (i, k). In this case, we can alsoapply the usual Thurstonian reconstruction method for paircomparison.

For our simulation, we firstly considered an artificialsequence of 31 stimuli as the ground truth, with impairmentson the perceptual scale, ranging from µ0 = 0 to µ30 =

2.0235, which corresponds to 2.0235/8−1(0.75) ≈ 3 JND.

138952 VOLUME 9, 2021


TABLE 3. Simulation for 31 stimuli over a range of 3 JND. Correlation with ground truth, and range of reconstructed means, averaged over1000 repetitions.

We randomly sampled from the uniform distribution on theinterval [0, µ30] to obtain the remaining 29 stimulus means.The triplets (i, j, k) were randomly sampled with the con-

straint i 6= j 6= k 6= i. The baseline triplets (i, 0, k) werechosen randomly with i, k 6= 0 and i 6= k . In five rounds wedrew 1000 to 20 000 triplets of each kind and generated oneresponse per triplet according to the Thurstonian probabilisticmodel. We then applied all triplet reconstruction methods tothe responses to triplets of the general type (i, j, k). For thebaseline triplets (i, 0, k) we carried out the reconstructionaccording to pair comparison and our proposed reconstruc-tion. We repeated the procedure 1000 times.

The results of our simulation are presented in Figure 6and Table 3. As expected, it is confirmed that our recon-struction does faithfully reconstruct the ground truth meansfrom the triplet comparisons that were generated by the sameThurstonian probability model that underlies our reconstruc-tion method. For baseline triplet comparisons, the recon-struction for pair comparisons from 20 000 ratings gave verysimilar results in terms of correlation as the reconstructionfor our triplet comparisons, an SROCC of 0.996 (not shownin the table).

Even more notable are the findings that the other algo-rithms, MLDS and STE, also produced excellent results interms of the Pearson linear as well as the Spearman rank-order correlation. However, the reconstruction ranges arearound 2 JND, thus, well below the correct 3 JND.

To calibrate the STE and MLDS methods to give resultsin JND units, one could tune their parameters α and σ suchthat the range of reconstructed impairment values is equalto 3. For our simulation, we applied the bisection methodto determine these parameters and obtained α = 0.5316and σ = 1.6594. The resulting correlations are excellentagain (see Table 3). The RMSE over all 30 reconstructedimpairments are 0.1089 forMLDS and 0.0646 for STE, whilefor our method (without tuning a parameter), we obtained anRMSE of 0.0520.

Of course, the above procedure is not feasible in generalbecause the ground truth range is not known as in our sim-ulation here. One would have to resort to an estimation of asuitable parameter α for MLDS or σ for STE. To this end,we propose two approaches.

1) Estimate the range of the expected scale values. In oursimulation, it was 3 JND, for example. Then proceed asin our simulation. Randomly choose a sequence of scale

values in the selected range, and then tune the parameterα, respectively σ , to achieve a reconstruction by MLDS,resp. STE, to match the selected range.

2) Minimize the mean square error of the MLDS probabil-ities for a response Rijk = 1 by selecting σ̂ ,

σ̂ = argminσ>0

∫ 1

−1

∫ 1

−1

|e(r, s | σ )|2 dr ds

where the error e(r, s | σ ) is given by

8

(|s| − |r|σ

)− 1+8(s)+8

(r + s√3

)− 28(s)8

(r + s√3

).

Here, r = µi − µj and s = µk − µj denote the left andright differences of impairments in the triplet (i, j, k),ranging over the square domain [−1,1] × [−1,1].Similarly, one can do the same for STE. In Figure 7,we visualize the probability functions according to theThurstonian model along with their approximation inthe MLDS and STE methods and the correspondingerrors e(r, s | σ ) resp. e(r, s |α). Inspecting this figure,it becomes apparent that the globally optimal parameterσ will be difficult to obtain as it strongly varies locally.

In summary, all three methods produced excellent results.For baseline triplet comparisons, reconstruction by the tradi-tional Thurstonian approach (with MLE) was as good as ourmethod for triplet construction. If one needs to have resultson the perceptual scale given in JND units, then our proposedreconstruction should be applied.

V. EXPERIMENTAL SETUP: MATERIALS ANDPROCEDURESThe purpose of our experimental studies was to investigatethe potential and limitations of the proposed boosting strate-gies in the application of subjective full-reference imagequality assessment. In current FR-IQA datasets, the mainapproaches have been DCR and PC with the reference imageshown additionally, i.e., a case of baseline triplet compari-son (Table 1). For both of these, boosting of the underlyingimage distortions can be applied. Thus, we carried out threemain experiments, starting out with baseline triplets, whichwe then extended to general triplets, finally followed by asmaller study for DCR. In the following, we refer to theseas Experiments I, II, and III.

VOLUME 9, 2021 138953


FIGURE 7. The top row shows plots of isolines of the probabilities Pr(Zijk > 0 |µ) for the Thurstonian (left), STE (middle), and MLDS (right) models. Theparameters α = 0.5316 for STE and σ = 1.6594 for MLDS were obtained by ensuring that the range of the corresponding reconstructed scales is equal to3 JND. The bottom row (center and right) shows the difference between the resulting probabilities of STE resp. MLDS and those of the Thurstonian model.The model fit for STE is much closer to the Thurstonian reference than that of MLDS. The bottom right plot shows the parameter σ for MLDS that locallyyields equality with the Thurstonian probabilities.

In order to evaluate aspects like accuracy, reliability, andconvergence, a large number of comparisons is beneficial.Therefore, our subjective IQA experiments were conductedvia crowdsourcing. For the study, a set of original pristinesource images, each distorted by various types of distortions,was selected. For each source image and each distortion type,a sequence of increasingly distorted images was generated.By means of a pilot study, we took care to calibrate ourdataset such that each such image sequence uniformly spansa perceptual quality range of approximately 3 JND. In thefollowing subsections, we briefly describe our setup andprocedure to achieve these goals.

A. SUBJECTIVE CROWDSOURCED IQA STUDYIn terms of experimental methodology, lab studies are wellestablished and considered reliable because the experimentalenvironment can be controlled, and the whole procedure canbe monitored. On the other hand, the number of images thatcan be assessed is limited due to the time requirements aswell as the cost. Alternatively, crowdsourcing studies aremore economical, more efficient, and more scalable, and canhave sufficient reliability if the setup, with a quality controlmechanism included, is appropriate [64] and a suitable outlier

removal strategy is employed [65]. We have installed severalmeasures of control to ensure the validity of the results fromour crowdsourcing campaigns, described in the following.

The experiments were carried out on the AmazonMechan-ical Turk [66] platform, in which requesters create and sub-mit their human intelligence tasks (HITs) for workers whocarry out the subjective quality assessment. Workers receivea monetary reward for completing a HIT. Requesters specifythe number of assignments for each HIT to control howmanyworkers can submit work for that HIT.

In our experiments with triplet comparisons, a HIT con-sisted of 20 questions that gave rise to 20 (ternary) responsesor answers from each crowdworker that completed an assign-ment for that HIT. For the experiment with degradation cat-egory ratings, HITs also had 20 questions each, and workersprovided the corresponding ratings on a 5-point DCR qualityscale.

B. INTERFACEAt the beginning of the experiment, a detailed instruction wasshown to the crowd workers, after which they were allowedto start doing assignments.

138954 VOLUME 9, 2021


FIGURE 8. The interface of the triplet comparison experiments without aflickering effect. Crowd workers were asked to select the image that theythink looks more similar to the pivot. They could choose ‘‘not sure’’ if theycould not distinguish the differences. For the experiments with aflickering effect, the interface was the same as the one without aflickering effect shown here, except that three still images were replacedby two flickering images, and the question got changed to ‘‘Which imagehas a stronger flickering effect?’’.

• In the parts of Experiments I and II that used TC withouta flickering effect (Plain, A-, Z-, and AZ-boosted TC),three images were displayed in a row, see Figure 8.Crowd workers selected the image that looked moresimilar to the pivot image in the middle by clicking‘‘left’’ or ‘‘right’’. If they could not decide, a thirdchoice, ‘‘not sure’’, was available. This option was intro-duced in subjective evaluations of the JPEGXS imagecompression [67] and had been found useful to reducesubject stress and fatigue. In an earlier study on theunforced-choice paradigm in applications in audiology,it was also concluded that the efficiency of pair compari-son might be compromised when participants are forcedto choose between stimuli [68].

• In the other parts of Experiments I and II that used TCwith a flickering effect (F-, AF-, ZF-, and AZF-boostedTC), two flickering images were displayed side by side.Crowd workers selected the image with a stronger per-ceived flickering effect by clicking ‘‘left’’ or ‘‘right’’,or use the ‘‘not sure’’ option.

• For Experiment III using DCR, two images were dis-played side by side, with the reference image on theleft and the test image on the right, see Figure 9. Crowdworkers rated the distortion of the test image on the5-scale category ratings ranging from 0 (imperceptible)to 4 (very annoying).

For each of the 20 questions in one assignment, crowd work-ers had eight seconds to enter their responses. The imageswere shown only during the first five seconds. In case noanswer was given by the crowd worker within the eight sec-onds, the response was labelled as ‘‘skipped’’. Thus, the totaltime for an assignment was 2m 40s.

C. HIT-LEVEL QUALITY CONTROLThe subjective assessment of image quality through crowd-sourcing may pose some challenges due to the lack of control

FIGURE 9. DCR Interface. Two images were placed side by side, with thereference image on the left. Crowd workers were asked to rate thedistortion of the right image w.r.t. its reference on the left on one of thefive categories: 0 (imperceptible), 1 (perceptible but not annoying),2 (slightly annoying), 3 (annoying), 4 (very annoying).

over the experimental environment, lack of knowledge aboutthe background of the workers, and limited reliability ofthe experimental results. Therefore, we need to detect andfilter out low-quality responses. Unreliable responses maybe caused by technical problems with the workers’ screensand devices, a misunderstanding of the subjective task, e.g.,limited English proficiency of some international workers,and a lack of attention. In addition, some of the workersmay try to answer quickly to get the maximum payment in ashorter amount of time, resulting in responses of insufficientquality.

We ensured the quality of workers’ answers in the crowd-sourcing studies by monitoring the number of questions ineach assignment that the workers skipped and includingone hidden test question in each assignment. For exam-ple, to select suitable test questions for Experiment I,we proceeded as follows. Based on our pilot study (seeSubsectionV-H2 below), we first chose baseline triplets withdistorted images on the left and right, for which the perceptualdifference in distortion was relatively large, about 1.75 JND.By a following visual inspection, we then discarded tripletcomparisons that did not seem to suggest a straightforwardcorrect response. For Experiments II and III, we proceededsimilarly.

If a worker skipped (did not answer) more than threequestions in an assignment or the hidden test question wasanswered incorrectly, the assignment was rejected, and all ofthe responses were discarded. We did not pay the workers fortheir rejected assignments.

The responses from rejected assignments were re-collectedby making the HITs available again for new workers. Thisrejection and re-collection procedure was carried out for three

VOLUME 9, 2021 138955


rounds.We did not reject any assignments in the last, smallestre-collection round, regardless of the performance on testquestions and the number of provided responses. By thisprocedure, we filtered out the low-quality responses at theassignment level and ensured that the desired number ofassignments was achieved.

Accepted assignmentsmight still contain outlier responses.In the next subsection, we describe the procedure for remov-ing such outliers. The statistics of the rejected assignmentsand outliers for all experiments are provided in Table 5.

D. ROBUST HIT-LEVEL OUTLIER REMOVALDuring data collection in each experiment, unreliable assign-ments for HITs were discarded and penalized as describedabove in Section V-C. However, there may still have beenuncovered unreliable data left that should be identified asoutliers and removed before the reconstruction of qualityscales. It seems inappropriate to classify individual responsesto triplet questions as outliers since the answers are not on aninterval scale but ternary (‘‘left’’, ‘‘right’’, and ‘‘not sure’’).Therefore, we considered outlier removal at the HIT assign-ment level, requiring a multivariate method. After the qual-ity control during each experiment, each assignment carried16–19 answers for the 19 triplet comparisons per assignment,not counting the single test question per assignment andallowing for skipping up to three questions.

To this end, we aim at a robustmultivariate outlier detectionmethod that flags a prescribed percentage of assignments thatmarkedly differ from the consensus given by the remainingmajority of assignments. A robust approach deviates fromconventional ones in that the statistics used to identify outliersdo not suffer from the influence of the outliers themselves.

The most common and recommended multivariate outlierdetectionmethod in this spirit is a fast version of theminimumcovariance determinant (Fast-MCD) approach [69]. How-ever, the MCD method operates with Mahalanobis distancesthat are not suitable in our case since the multivariate data arenot only vectors of ternary decisions rather than being froma real Euclidean vector space, but also of variable dimension(16–19).

A similar approach was given by the k-means– algo-rithm [70]. This modification of the classical k-means cluster-ing algorithm takes a desired number of outliers into accountthat is farthest from the cluster centers. Convergence of localoptima was proven. Cluster centers are given by the meansof the cluster data points. For our application, the methodwould have to be run for a single cluster (k = 1). However,this does not work since data from HIT assignments are notsimply vectors of some vector space, and these data cannotbe averaged.

To remove HIT assignments that are outliers, we proposean adaptation of the above two robust detection methods.

Firstly, we define the consensus of a subset of assignmentsfor a HIT due to the reconstructed impairment scale valuesfor all stimuli involved in the corresponding experimental

study. This consensus replaces the cluster means as used inFast-MCD and k-means–.

Secondly, we need an algorithm to compute the distanceof each assignment to the consensus given by the impair-ment scales reconstructed from the corresponding majorityof assignments. A small distance should indicate that theresponses collected in a HIT assignment agree well with thereconstructed impairments. Large distances suggest strongdisagreement with the consensus and that the correspondingHIT assignments may be regarded as outliers.

1) DISTANCE TO CONSENSUS FOR AN ASSIGNMENT OFTRIPLET COMPARISONSFor the case of triplet comparisons, we define this distance foran assignment as the complementary weighted true positiverate of the corresponding N responses with respect to thegiven consensus as follows. Considering the n-th responseto a triplet question of type (i, j, k) for a particular referenceimage I0, a given distortion type, and corresponding distortedimages Ii, Ij, Ik that make up the triplet question, we compareit with the corresponding impairment scale values µ̂i, µ̂j, µ̂kfrom the current consensus. If the answer ‘‘left’’ (resp.‘‘right’’) for this n-th triplet question is in accordance withthe consensus, we assign a score of value vn = 1 to it. Theanswer ‘‘not sure’’ earns a score of vn = 0.5. The followingtable completes the definition of the score vn for the responseto the n-th triplet question (i, j, k).

Response |µ̂k − µ̂j| ≥ |µ̂i − µ̂j| |µ̂k − µ̂j| < |µ̂i − µ̂j|

left 1 0right 0 1

not sure 0.5 0.5

The difficulty of triplet questions varies according to thedifference between the left and right differences of impair-ment scales, Dl = |µ̂i − µ̂j| and Dr = |µ̂k − µ̂j|.If Dr ≈ Dl , the decision which is perceptually the smallerone is hard. In this case, an answer that disagrees with the con-sensus should not be penalized severely. On the other hand,if |Dr − Dl | is large, a wrong decision should be penalizedmore strongly. Therefore we introduce the weight

wn = |Dr − Dl | = |µ̂k − µ̂j| − |µ̂i − µ̂j|

for the n-th response and define the distance of the assignmentw.r.t. the consensus as

d = 1−

∑Nn=1 wnvn∑Nn=1 wn

. (8)

We have that 0 ≤ d ≤ 1, and the maximal distanceof d = 1 implies that all responses in that assignment wereagainst the consensus of the majority.

2) DISTANCE TO CONSENSUS FOR AN ASSIGNMENT OFDCR QUESTIONSIn case of an assignment of DCR questions, the consensusproduced by a subset of HIT assignments is given by the

138956 VOLUME 9, 2021


FIGURE 10. The (cropped) source images for our experiment. From upper left to lower right: source images with a resolution of 512× 384 cropped fromthe images in MCL-JCI dataset with the following IDs: SRC01, SRC03, SRC06, SRC07, SRC09, SRC17, SRC28, SRC31, SRC45, and SRC50, respectively. Thegreen inset rectangles (256× 192 pixels) indicate the regions used for the boosting methods with zooming.

corresponding DMOS values for all test images involved inthe experiment. In a HIT assignment with ratings rn for pairs(In0 , I

nk(n)

)and corresponding DMOS values µ̂

(Ink(n)

), n =

1, . . . ,N , from the consensus, we define the distance of thegiven ratings to the consensus as the mean of the absolutedifferences to the consensus,

d =1N

N∑n=1

∣∣∣rn − µ̂ (Ink(n))∣∣∣ . (9)

With these definitions made, we now give the iterative,robust outlier removal algorithm.1) Input: M HIT assignments with responses, a target

L < M of assignments to be kept.2) Start with the subsample of allM HIT assignments.3) Compute the consensus of the subsample: the recon-

struction of all impairment scales.4) Compute the distances of all M assignments from the

consensus with Equation (8), resp. (9).5) Choose the L smallest distances and create a new subset.6) Repeat steps 3 to 5 until convergence (the new subset is

the same as the old one) or a timeout.In our experiments, we removed a fraction of 5% of HITassignments as outliers and observed convergence in just4–7 iterations.

E. SOURCE IMAGESTen source images were selected from the MCL-JCIdataset [29], whose original resolution is 1080×1920. In our

subjective study, the original resolution is too large to displayon the screens of crowd workers. To ensure that a tripletcan be displayed without image re-scaling, we manuallycropped each image to 512 × 384 pixels. We chose to cropportrait-mode subimages because triplets of such images bet-ter utilize screen space. We further cropped the images to256× 196 pixels for experiments with boosting by zoomingand subsequently upscaled them back to 512×384 pixels fordisplay. Figure 10 shows the ten (cropped) source images andtheir parts used for zooming.

F. DISTORTION TYPESFor our validation experiments, the source images weredegraded by seven distortions, selected from [21], [71], [72].All distortions were implemented in MATLAB, with thesource code made available by the authors.

• Color Diffusion converts an image from RGB toCIELAB color space, where a 2D Gaussian smoothingkernel is used to blur the a and b channels. Its distortionmagnitude is determined by the standard deviation of thekernel.

• High Sharpen applies the unsharp masking method tosharpen an image. An image is sharpened by subtractinga blurred unsharp) version of the image from itself. Itsdistortion magnitude is determined by the parameter ofthe strength of the sharpening effect.

• Jitter warps an image according to two matricesdescribing random local shifts of each pixel in horizontal

VOLUME 9, 2021 138957


FIGURE 11. One of the source images (upper left) is distorted by each of the seven considered types of distortion. The degree ofdistortion is the largest used in this study. It corresponds to the third JND w.r.t. the source image.

and vertical directions. The amount of distortion is deter-mined by the magnitude of the shift.

• JPEG2000 is an image compression standard with thedistortion magnitude determined by the compressionratio.

• Lens Blur performs spatial 2D filtering on each colorchannel of an image with a circular kernel, whose distor-tion magnitude is determined by the radius of the kernel.

• Motion Blur performs spatial 2D filtering on each colorchannel of an image with a kernel oriented at 45 degrees.The distortionmagnitude is determined by the size in thedirection of motion.

• Multiplicative Noise adds speckle noise to an imageI , obtaining I + n � I , where n(x, y) are i.i.d. randomvariables, uniformly distributed with zero mean and anadjustable variance. The magnitude of the distortion isdetermined by the variance.

Figure 11 shows one of our source images together withdistorted images for all of the seven distortion types.2

G. DISTORTION LEVELS: DESIGNFor each combination of a source image and a distortion type,a sequence of increasingly distorted images is to be defined.

2In a pilot study prior to this work (unpublished), we used JPEG com-pression, Gaussian noise, and Gaussian blur. The results showed that theboosting strategies (A-,Z-, and AZ-Boosting) work well also for these typesof distortions.

In previous FR-IQA datasets, only 4–6 levels of distor-tion were considered (Table 1). The boosting techniques areexpected to help differentiate between distorted images thatlook the same at first glance. Thus, for our first main exper-iment, we strove to generate image sequences such that theperceptual difference of any two successive images is fixed atonly 0.25 JND. This is a very small difference: According tothe Thurstonian probabilistic model, in a 2AFC experiment,the fraction of observations that correctly identify the stimu-lus with better quality is 8(0.258−1(0.75)) ≈ 0.5670. Thecorresponding detection rate is only 2 ·0.5670−1 ≈ 0.1339.Our second main experiment shrunk the spacing even more,to 0.1 JND, corresponding to an expected detection rate ofmerely 0.0538. However, in this case, we considered only oneparticular type of distortion to keep the overall cost within ourbudget for the crowdsourcing.

Having defined the psychovisual spacing of distortion ineach image sequence, the question remained over what rangeof distortions the sequence should span. In a recent study onjust noticeable differences in video sequences, it was foundthat the perceptual quality at the third JND is between fair andpoor on the 5-point ACR scale [32]. The authors concludedthat distortions stronger than 3 JND are not acceptable bytoday’s viewers. Thus, we set the range of impairment foreach image sequence to 3 JND. For content providers of high-quality media, we believe the first third of this range, fromlossless compression up to 1 JND is the most relevant.

138958 VOLUME 9, 2021


Therefore, in our first main experiment, an image sequenceconsisted of the pristine, original image together with 12 dis-torted versions at impairments of 0.25 k JND, with k =1, . . . , 12. In the second experiment, we had even 30 distortedimages uniformly ranging over 3 JND. Inmany of our figures,we show results over these 12, respectively 30, distortionlevels. Since all reference source images and all distortiontypes test images at the same distortion level correspond tonearly the same perceived magnitude of distortion, we canaverage results over these image sequences for each distortionlevel.

H. TEST IMAGE GENERATIONIn order to generate sequences of distorted images with thedesired equal spacing on the perceptual scale, we carried outa pilot study. The overall procedure for each of the 70 imagesequences (10 source images, 7 distortion types) was asfollows:1) Generate a sequence of 12 distorted images by increas-

ing the corresponding scalar distortion parameter λ. Theparameters are chosen using the method of bisection toyield an image sequence that is equally spaced accordingto the structural similarity index SSIM [73].

2) Run a crowdsourcing study using pair comparisonwith the additional display of the corresponding sourceimage, which is equivalent to the baseline triplet com-parison strategy.

3) Reconstruct the impairment scales from the comparisondata in JND units. Calibrate the scales such that theimpairment for the pristine source image is equal to 0.The result is a sequence of impairments, parametrizedover the corresponding distortion parameter λ for thegiven distortion type.

4) Truncate this sequence such that only the last remainingimpairment value is outside of the range from 0 to3 JND. Fit a straight line to these data points, constrainedto pass through the point for the source image. Withoutloss of generality, we may assume that this is the origin,(0, 0). Let the slope of the line fit be s > 0. Then theexpected impairment at parameter λ according to the linefit is sλ.

5) Define the sequence of distortion parameters as

λk =k4s, k = 0, . . . , 12.

6) Generate the sequence of distorted images accordingto these parameter settings. As a result, the impair-ment scale of the k-th image will be approximately0.25 k JND, and the distortions span a total range ofabout 3 JND.

Figure 12 shows an example of impairment reconstructionsfor one image sequence together with the line fit and theresulting choices of 12 physical distortion parameters, whichwe can expect to correspond to distorted images uniformlyspaced 0.25 JND apart from each other.

For the second experiment the intended spacing in impair-ment is 0.1 JND, and the procedure is the same as above,

except for Step 5, where we set λk = k10s , k = 0, . . . , 30.

This was done for 10 image sequences from the 10 sourceimages, but only one distortion type, motion blur.

In the rest of this subsection, we briefly describe the detailsof the selection of the baseline triplet comparisons in the pilotstudy, the numbers of collected responses, and the qualitycontrol.

1) SAMPLING STRATEGYOur sampling strategy proceeded in three rounds of datacollection. The first round was for initialization. For each ofthe 70 image sequences, we randomly sampled pairs of the13 test images (including the source image) by choosing theedges of a random sparse graph with a vertex degree of sixand nodes corresponding to images. Thus, each image wasrandomly compared to six other images of the same sequence.For all sequences together, this resulted in 39 × 70 = 2730triplets of images. We used baseline triplet comparisons,so the pivot image in the center of the triplets was fixed asthe corresponding source image.

In the second and third rounds, we applied an active sam-pling strategy. A minimal spanning tree connecting 13 nodes(i.e., test images) was produced for each sequence. The12 edges then yielded the test images for the triplet questionsand were chosen based on maximizing the expected informa-tion gain, following [74]. This resulted in 12 × 70 = 840triplets of images in each round.

Each triplet question was presented to 30 crowd workers.Hence, overall (2730+ 2× 840)× 30 = 132 300 responseswere collected.

2) QUALITY CONTROLThe quality of the experiment was controlled as describedin SectionV-C and by a simplified outlier removal process.There were 1035 assignments that were rejected and recol-lected because the test question was answered incorrectly,or more than three questions were skipped.

The outlier detection in the pilot study was chosen differ-ently from that in the main experiment. This is because thepilot study’s purpose was merely to help generate test imagesequences with prescribed impairment scales. This requiredaccurate reconstructions of these scales for the test imagesused in the pilot study. To identify HIT assignments withunreliable responses, we, therefore, could rely on the groundtruth, given by the ordering of the test stimuli on the physicaldistortion scale.

Consider the n-th baseline triplet comparison (i, 0, k) ofa HIT assignment, where i and k denote the indices of thetest images in a given sequence. Similar to SubsectionV-D1,we gave a score un ∈ {0, 0.5, 1} to its response as specifiedin the following table.

Response i < k k < ileft 1 0right 0 1

not sure 0.5 0.5

VOLUME 9, 2021 138959


FIGURE 12. Example of the linear fitting procedure in the pilot study forthe image sequence with source image SCR03 and distortion type ‘‘HighSharpen’’. On the horizontal axis, the physical distortion parameter for theselected distortion type increases linearly. 13 impairment values (bluecrosses) were reconstructed, the last 4 of them being greater than 3 JND.The last 3 of these are disregarded and not shown. From the remaining10 points, the line fit was generated. Using this result, 13 equally spacedphysical distortion parameters, as shown, were obtained, which span therange of 3 JND on the perceptual impairment scale. These parametersdefine the twelve distortion levels for the image sequence used inExperiments I and III. For this example, a difference of one distortionlevel corresponds to a difference of 0.322 for the physical distortionparameter of the type ‘‘High Sharpen’’.

Note, that by construction, i 6= k . The normalized sum ofthe scores in an assignment, 1

N

∑Nn=1 un can be called the true

positive rate. The distance of an assignment w.r.t. the groundtruth ordering is defined as

d = 1−1N

N∑n=1

un. (10)

After computing these distances to ground truth orderingfor all assignments, we sorted them according to distance andremoved the last 10% of them as outliers.

I. COMPARISON OF RESOLUTIONS OF FR-IQA DATASETSOur dataset is the first one designed based on perceptualcriteria that had been assessed in a pilot study. For eachcombination of a source image and a distortion type, the goalwas to generate a sequence of distorted images with equalincrements of perceived impairment. In Part A of our dataset,this increment is 0.25 JND, and in Part B, it is 0.1 JND.In other datasets, the distortion parameters either wereselected manually (e.g., LIVE, VCL@FER, CID:IQ, MDID,and KADID-10k) or according to a plan w.r.t. increments ofPSNR or bitrate (e.g., TID2008 and TID2013).

For an image sequence in a fine-grained FR-IQA dataset,we expect a large number of distortion levels spread over the

FIGURE 13. The resolution of FR-IQA datasets, defined by the number ofdistortion levels per dB on the PSNR scale, generally varies between0.2 and 0.5 levels/dB for conventional datasets. Our new fine-graineddatasets, KonFiG-IQA, Parts A and B, have much greater resolution over alarge portion of the entire PSNR range. The data points (black crosses)are samples of the resolution function, computed for KonFiG-IQA, Part A.The curves in the figure were obtained by a Gaussian averaging filter ofwidth 2 dB.

respective ranges of distortion, in our case 3 JND. To quan-tify and compare different FR-IQA datasets in this regard,we introduce the notion of dataset resolution. As a ref-erence scale, we adopt the PSNR in dB. The resolutionfor an FR-IQA dataset then is a function of the PSNR,which indicates how close PSNRs of consecutive imagesfrom image sequences are in the neighbourhood of the givenPSNR.We chose to define resolution locally this way becauseof the nonlinear relationship between PSNR and perceivedquality which causes the resolution function to vary sig-nificantly, particularly for our perceptually guided FR-IQAdatasets.

TID2013 adopted a spacing of 3 dB PSNR between con-secutive images in a sequence, so its dataset resolution is1/3 level per dB PSNR, in this case uniformly over the entirerange of distortion. A resolution of 0.4 levels/dB at 28 dBPSNR means that the average length of intervals of PSNRvalues of consecutive images in a sequence including 28 dB,is 2.5 dB, the inverse of the resolution of 0.4 levels/dB.

We computed the dataset resolution functions for ourdatasets, KonFiG-IQA Parts A and B, and six other datasets,namely LIVE, CSIQ IQA, VCL@FER, TID2013, CID:IQ,and KADID-10k. For each of these, the procedure wasas follows. First, we scanned consecutive images in eachimage sequence of the FR-IQA dataset and collected thecorresponding PSNR intervals. Secondly, we sampled thePSNR scale uniformly with a step size of 0.2 dB, and foreach sample value, we averaged the lengths of all thoseintervals that contain the PSNR sample. The inverse of this

138960 VOLUME 9, 2021


FIGURE 14. Average reconstructed impairment scales over 70 sequencesfor 8 types of TC with baseline triplets. Scales from AZF-boosted TC havethe largest range, almost 3 times as large as the range of Plain TC.

average length is the value of the resolution function atthe given PSNR sample value. Due to the discrete nature,the resolution function is only piecewise constant and appearsnoisy as there are some intervals of very small size. Forvisualisation, therefore, we show a smooth approximationobtained using a Gaussian averaging filter with a widthof 2 dB PSNR.

Figure 13 shows the resulting resolution functions forthe selected datasets and also the samples collected forKonFiG-IQA (Part A). The figure confirms that our newfine-grained datasets have much greater resolution over alarge portion of the total PSNR range.

We also note that for TID2013, the design goal to have3 dB PSNR between consecutive images in each sequence(0.33 levels/dB) was not strictly followed. The dataset resolu-tion mainly varies between 0.2 and 0.4 levels/dB and exceeds1 level/dB at the low end of the PSNR scale.

VI. EXPERIMENT I: BOOSTING FOR BASELINE TRIPLETCOMPARISONThe purpose of our first experiment is to apply the proposedboosting methods to our FR-IQA dataset in a crowdsourc-ing campaign for the analysis of the performance w.r.t. thatachieved without boosting as traditionally done. Here we usebaseline triplet comparisons, i.e., we present two differentdistorted test images, Ii and Ik , together with the undistortedreference image I0 in the middle of the triplet, which isdenoted by (i, 0, k). This interface corresponds to that usedin TID2008, TID2013, and MDID.

We recall from the previous section that we have 10 sourceimages, each distorted by 7 types of distortion, giving70 image sequences, each consisting of a reference (source)

image I0 and 12 increasingly distorted versions of the refer-ence image, I1, . . . , I12. By design, the distortions span 3 JNDunits on the perceptual scale, so the perceptual differencebetween two consecutive images is 0.25 JND. For each of the70 sequences, we applied 8 types of baseline triplet compar-isons, namely the plain TC without boosting and A-, Z-, AZ-,F-, AF-, ZF-, AZF-boosted baseline TC.

In this setup, for each sequence, there are(132

)= 78

possible triplet comparisons. Since there is less informationgain in responses for triplets for which the correct answeris obvious, we only considered the 68 triplets (i, 0, k) with|k− i| ≤ 8 and k 6= i. Thus, triplets (i, 0, k) with a perceptualdistance between Ik and Ii greater than 2 JND were omitted.In this way we generated two groups of 68×70×4 = 19 040triplets each. The first group contained those triplets that wereused for comparisons without boosting by flicker. For theother group of triplets, to be used with F-boosting, a dif-ferent interface had to be applied (see SectionV-B), so theywere collected in separate HITs in the crowdsourcing. Thetriplets were randomly oriented (either as (i, 0, k) or (k, 0, i))and shuffled in each group. Finally, they were split up anddistributed into crowdsourcing HITs. Each HIT consistedof 19 triplet comparisons and 1 additional triplet comparisonfrom our pool of test questions.

We spawned 20 assignments per HIT, i.e., we collected20 responses for each triplet comparison. We controlledthe quality of the experiment and removed outliers asdescribed in SectionsV-C andV-D. In the end, 37 206assignments of HITs with 706 914 TC responses remainedfor the reconstruction of impairment scales for 70 imagesequences and analysis. For details see Table 4 and 5. Usingthe method proposed in Section IV, the perceived impair-ments of the images were reconstructed from the set ofTC responses. We evaluated and compared the perfor-mances of the eight types of TC by several criteria asfollows.

A. RESULTS: RECONSTRUCTED IMPAIRMENT SCALESFigure 14 summarizes the reconstructed impairment scalesfor stimuli at the 12 distortion levels. Each curve representsthe average taken over all 70 image sequences of sevendistortion types. Themain findings from this global view overthe range of 3 JND are as follows:

• The result from plain triplet comparisons is given bythe black dashed curve. It shows a linearly increasingimpairment, reaching 3.1 JND for distortion level 12.Thus, Experiment I confirms our expectations from thepilot study quite well, a perceptually linearly increasingimpairment over 3 JND.

• The three colored dashed curves are for the results with-out boosting by flicker. They are well above the baselinegiven by plain TC and extend the range of perceiveddistortion from 3 to about 4 JND, with the red curvefor combined AZ-boosting yielding the largest increase.Thus, for boosting with artefact amplification, zooming,

VOLUME 9, 2021 138961


TABLE 4. Overview of all subjective studies.

TABLE 5. Number of HIT assignments in all experiments.

or their combination, we have gained 1 JND over therange of the first 3 JND.

• The four solid curves are for the results with boostingby flicker. These provide an additional large increase inperformance. Just the boosting by flicker alone extendsthe range of perceived distortion to 5 JND. The com-binations with either zooming or artefact amplificationprovide about 6.5 JND in place of only 3 JND, givenby traditional plain comparisons. However, the top per-forming boosting is the combination of all three meth-ods, AZF-boosting, giving close to 9 JND and providingan overall increase by a factor of almost 3 over plaincomparison.

These findings hold for the averages taken over all distor-tion types and source images. The impairment curves for thedifferent distortion types, averaged over the 10 source imagesfor the sequences, showmore detailed views of the results andcan be found in Figure 15.

B. RESULTS: SENSITIVITY GAINThe strongest effect of the boosting strategies can be observedfor smaller distortion levels, up to 1 JND. To quantify theperformance of the boosting strategies also locally at differentdistortion levels, we introduce the concept of sensitivity andsensitivity gain. In general, a sensitivity analysis determineshow changes in an independent variable affect a particulardependent variable. If a differentiable function gives their

functional dependence, we can use its derivative as a measureof sensitivity.

In our case, we think of the distortion level as theindependent variable and the corresponding reconstructedimpairment scales from the eight types of TC as depen-dent variables. Although the distortion levels are discrete,they correspond to equally spaced, physical, and real-valueddistortion parameters. It may be assumed that the resultingimpairments of image quality can be modelled by a contin-uous and differentiable function. Given our data, we appliedthe 5-parameter logistic fitting [17] to the curves in Figure 14using

β1

(12−

11+ exp (β2(x − β3))

)+ β4x + β5, (11)

where x denotes the distortion level, and β1 to β5 arethe parameters for the fit. In the numerical optimiza-tion procedure for the curve fitting, local optima will beobtained, depending on the choice of initial parameters.Therefore, we ran the optimization multiple times with dif-ferent initial conditions and visually checked the fittingquality.

The derivatives of the fitted functions then model theobserved sensitivities of the plain and boosting techniques toassess the impairment. The sensitivity is also a function ofthe physical distortion magnitude, respectively the distortionlevel, allowing a local analysis.

We define the sensitivity gain, provided by a particularboosting method as the quotient of the sensitivity for a boost-ing method and the sensitivity of the method using plaintriplet comparisons. A gain larger than 1 indicates an increasein the sensitivity of that factor due to boosting. Figure 16 (left)illustrates the procedure for the case of AZ-boosting. Firstly,it clearly shows that the curve fitting yielded visually con-vincing smooth functional approximations of the empiricaldata. Secondly, we see that AZ-boosting yielded a sensitivitygain greater than 1 for distortion levels up to 5 (correspond-ing to about 1.25 JND), but the method using plain baselineTC was more sensitive for larger levels than the one withAZ-boosted TC.

The right part of Figure 16 shows the sensitivity gainas a function of the distortion level for all seven types ofboosting. The curves demonstrate that the boosting methods

138962 VOLUME 9, 2021


FIGURE 15. Average reconstructed JND scales as in Figure 14 for each type of distortion. Each point on any of these graphs corresponds to the meanimpairment scale in JND units of the 10 source images, distorted with the respective type and at the given level.

for baseline triplet comparison are most effective for smallerdistortions up to about 1 JND,which corresponds to distortionlevel 4. Especially for the boosting with flicker, sensitivitygains larger than 2 were achieved.

For larger distortion levels near 2 JND, a kind of saturationeffect can be noticed; the gain dropped below 2 but still islarger than 1, except for A- and AZ-boosting. This indicatesthat perceptually, boosting small distortions to a referenceimage makes their difference more apparent than boostinglarge distortions. This effect is particularly strong for arte-fact amplification among the basic A-, Z- and F-boostingtechniques. The sensitivity gain almost linearly drops from2 at distortion level 0 to 0.8 at distortion level 12. This may,in part, be explained by the nonlinearity of the boosting dueto pixel value clamping, see Table 2. Naturally, for smalldistortions in the test images, there is less clamping to beexpected than in test images with large distortions, and sucha deficiency does not hold for the other two basic boostingtypes.

C. RESULTS: TRUE POSITIVE RATERecall from SectionV-H2 that the true positive rate (TPR)for a set of baseline triplet responses is the ratio of correctanswers w.r.t. the ground truth given by the ordering of the

TABLE 6. True positive rate and average response times for tripletcomparisons.

stimuli of an image sequence. In this regard, a response oftype ‘‘not sure’’ scores 1/2 point. Therefore, the TPR canalso be considered a criterion for validating the performanceimprovement by the boosting methods. A TPR of 0.5 canbe achieved by guessing alone. The TPR for boosted TC isexpected to be larger than for plain TC, as boosting shouldenable observers to make more correct decisions regardingimage quality differences.

Table 6 shows the first overall average results for the TPRsin Experiment I (with increasing order) and confirms ourexpectation. The TPRs show a similar ordering as observedin the average reconstructed impairment scales at larger dis-tortion levels, as shown in Figure 14:

Plain, Z, A, AZ, F, AF, ZF, AZF.

VOLUME 9, 2021 138963


FIGURE 16. Left: The sensitivity gain of AZ-boosted baseline triplet comparison is illustrated. The gain is the ratio of the derivatives of the fittedimpairment scale functions. Thus, the gain is the factor by which an increase of perceived distortion is multiplied when AZ-boosting is applied.Here, the sensitivity gain is greater than 1 for distortion levels up to 5 but less than 1 at larger levels. This indicates that AZ-boosting is moresensitive than plain baseline TC for small distortions up to around 1.25 JND and less sensitive for larger levels. Right: Sensitivity gain for all seventypes of boosted baseline triplet comparisons. It shows that boosting in baseline triplet comparisons is most effective for smaller distortions.

For a more detailed perspective, we consider comparisonsas typically done to assess the JND threshold in imagesequences with increasing distortion. This amounts to com-parisons of distorted images with the corresponding sourcereference images. In terms of triplet comparisons, we there-fore look at triplets (0, 0, k) (or (i, 0, 0)). We expect that theTPR increases with the distortion level k (resp. i) and thatboosted TCs give rise to larger TPRs.

Figure 17 shows the TPRs for this analysis, averaged overthe 70 sequences in our dataset. Corresponding detectionrates linearly increase from 0% for a TPR of 0.5 to 100%for a TPR of 1, and the JND threshold on one of the curves isreached at a TPR of 75%. For the case of plain TC, we seethat the JND threshold was reached at distortion level 3.The dataset was designed to have the threshold at distortionlevel 4, and there the TPR is 0.76, still close to 0.75 asexpected.

All seven types of boosting increase the TPR compared toplain TC. To give an example, consider test images havingjust one distortion level difference (0.25 JND). Plain compar-ison yielded a TPR of only 0.52, which is not much better thanguessing.3 On the other hand, with combined AZF-boosting,the TPR is 0.88, a very strong improvement.

On the far end of the scale at levels 6 to 8, corresponding todistortions 1.5 to 2 JND, the gains in TPR achieved by boost-ing are smaller. This is due to the saturation effect describedearlier, i.e., consistent with our findings for the sensitivity

3In SectionV-G we showed that the expected TPR of the corresponding2AFC pair comparison for a perceptual difference of 0.25 JND is 0.567,which is a bit larger than the empirical TPR of 0.52 observed here. This maybe due to the basic assumption in Thurstonian models that the distributionsof the latent perceptual image quality scales are perfectly Gaussian. It isunknown to what extent this assumption holds true.

FIGURE 17. For assessment of the JND threshold, distorted images arecompared with the source image. The figure shows the correspondingTPRs from our corresponding triplet comparisons, averaged over all70 image sequences. Corresponding detection rates linearly increase from0% for a TPR of 0.5 to 100% for a TPR of 1. Boosting increases the TPRand reduces the JND threshold, which can be read off the graphs at theTPR of 0.75.

gain by boosting. These gains are much more pronounced forsmall distortions levels.

D. RESULTS: RESPONSE TIMEBoosting not only helps observers to find the correct answersto comparisons but also reduces their response time. Table 6

138964 VOLUME 9, 2021


shows the response times in seconds for all methods with andwithout boosting, averaged over all baseline triplet compar-isons and with corresponding standard deviations. A responsefor a plain TC required 2.3 seconds on average. All typesof boosted TC were faster, with AZF-boosting requiringslightly less than 2 seconds on average. The two-sample t-testrevealed that these time savings are statistically significant,with very small p-values less than 10−11.

VII. EXPERIMENT II: BOOSTING FOR GENERAL TRIPLETCOMPARISONIn the previous section, we noticed that there is a saturationeffect for larger distortion levels, and the reason could bethat boosting small distortions of two test images makes theirdifference stand out better than boosting large distortionswiththe same difference in distortion levels.

To ameliorate this drop in effectiveness of boosting forlarger distortion levels, we consider general triplet compar-isons (i, j, k), where the pivot image Ij is not fixed to be theundistorted reference image I0 of a sequence. Instead, wemayallow arbitrary triplets with different pairwise stimuli andselect the stimulus with the median distortion magnitude asthe pivot.

The artefact amplification then linearly increases the dif-ferences with respect to the pivot as before and not the dis-tortions concerning the pristine reference images. Typically,these image differences to be enlarged will be smaller than inExperiment I with baseline triplets. Likewise, in the flickertechnique, the flicker is between distortion levels that typi-cally are closer together than with baseline triplets. In Exper-iment I, we have seen larger sensitivity gains for such smallerquality differences. Thus, we expect to arrive at a sensitivitygain that is enlarged over the whole range of distortion levels,thereby reducing or even eliminating the saturation effectobserved for baseline triplet comparisons.

The second motivation to conduct Experiment II is toassess and compare the performance of boosted versus plainTC in terms of the convergence as the number of TCsincreases. On the one hand, the reconstructed impairmentscales converge, and for the analysis, we consider theirconfidence intervals from samples of TCs as estimates forthe precision of the computed impairment scales. On theother hand, for a given set of distorted images derived froma fixed reference image, the ordering of the reconstructedimpairment values should converge, and the Spearman rank-order correlation (SROCC) is the appropriate measure for thisanalysis.

To study such convergence aspects, a very large pool ofTC responses and a challenging set of image sequences aredesirable. Therefore, we increased the number of distortionlevels in each image sequence from 12 to 30, so that thespacing between consecutive test images is only 0.1 JND.The number of TCs per image sequence was increased from1360 to 9585 and 29 070 for boosted and plain TC, respec-tively. To limit the cost for such an enlarged study, we choseto restrict Experiment II to only one type of distortion,

FIGURE 18. Experiment II (general triplet comparisons, motion blurdistortion): The reconstruction of impairment scales from plain andAZF-boosted TC are shown by the curves with square and circularmarkers, respectively. On average, the red curve increases with a slope7 times as large as that for the blue curve. Thus, the overall gain insensitivity by AZF-boosting is given by a factor of about 7. The sensitivitygain function for AZF-boosting is also shown (solid red curve). It isglobally larger than 5, while the corresponding sensitivity gain functionfrom Experiment I (baseline triplet comparisons, motion blur distortion),shown by the solid green curve, is everywhere below 5.

motion blur, and to the most promising type of boosting,AZF-boosting. All 10 source images were used, resultingin 10 sequences of increasing motion blur distortion.

For 31 images I0, I1, . . . , I30 in each of the 10 imagesequences, we considered triplets (i, j, k) with i < j < k .We further limited the span of these triplets. The span ofa triplet (i, j, k) is S = max(i, j, k) − min(i, j, k) whichin this case is S = k − i. For boosted TC, we set themaximal span to 10, and for plain TC to 20, correspondingto 1 and 2 JND, respectively.We anticipated that small differ-ences were harder to detect with plain TC, so larger spans forplain TC than for boosted TC would be advisable for a faircomparison. The number of triplets for each image sequencethus was

∑Sn=2(31 − n)(n − 1), which is 3230 for plain and

1065 for boosted TC.For each of the triplet comparisons, we collected

9 responses from the crowd workers. The presentation ofthe triplets was randomized in sequence and in orientation,showing either (i, j, k) or (k, j, i). The quality control andoutlier removal were carried out as in Experiment I. Fordetails regarding the numbers of HITs that were rejected orclassified as outliers, see Table 5.

A. RESULTS: RECONSTRUCTION AND SENSITIVITY GAINWe reconstructed the impairment scales for the 10 sequencesfrom the collected responses to the plain and theAZF-boosted triplet comparisons. The results, averaged over

VOLUME 9, 2021 138965


FIGURE 19. Average TPR for all triplets (i, j,k) and distanceD =

∣∣|i − j | − |k − j |∣∣ for all 10 sources. The advantage of boosting by

artefact amplification, zooming, and flickering is obvious: The easiestplain triplet comparisons are for D = 8, and yet they are harder than thehardest AZF-boosted triplet comparisons at D = 1.

the 10 sequences, are shown in Figure 18. For plain TC weobtained a very slowly increasing impairment, reaching 4.0JND at distortion level 30. For AZF-boosted TC, however,we obtained a range of 30 JND. Thus, over the range of 3 JNDof motion blur distortion, we recorded an average sensitivitygain of 7.5 for AZF-boosting. The same boosting for baselinetriplets gave an average gain of only 1.8.

Figure 18 also displays the achieved local sensitivity gainof AZF-boosted TC over plain TC as functions of the dis-tortion levels. The gain function is shown for baseline tripletsused in Experiment I (green curve) and for the general tripletsused here (red curve). Over the whole range of distortionlevels, the gain achieved in Experiment II is larger than 5.Moreover, the gain is decreasing up until level 12 and thenincreases again, reaching 9.6 at level 30. In contrast, forbaseline TC of Experiment I, the gain is limited below 5 andslowly decreases down to 1.9.

Altogether, our first conjecture, to improve on the sen-sitivity gain by using general triplets in place of baselinetriplets, was clearly confirmed with an impressive tripling ofperformance in sensitivity.

B. RESULTS: TRUE POSITIVE RATEThe true positive rate is an indicator of the ease of the cor-responding set of triplet comparisons. As for Experiment I,we computed the true positive rate for plain and boostedtriplet comparison, in this case, averaged over all tripletsi, j, k with i < j < k and span S = k − i ≤ 10. Theaverage TPR for the 10 sequences of motion-blurred imagesis 0.5583 for plain TC and 0.8810 for AZF-boosted TC.

Compared to Experiment I, we see that AZF-boosting bringsabout an even larger increase of overall TPR. We obtained animprovement of 0.3227 using AZF-boosted TC over plain TCwith general triplets.With baseline triplets, the correspondingimprovement was smaller, 0.1313, for the case of motion blur,and 0.0924 for all types of distortion on average.

Figure 19 shows a more detailed view, breaking up theaverages into eight parts, based on the absolute differencesD = ||i− j| − |k − j|| = 1, . . . , 8. Triplet comparisons withlarger values of D can be expected to be easier to judgecorrectly, and this is reflected in themonotonic increase of theTPR. It is remarkable that the TPR for AZF-boosting, evenfor the smallest difference of just 1 level between perceptualdistances of the left and right image to the pivot image,is larger than any of the detailed TPRs for the plain tripletcomparison.

C. RESULTS: CONVERGENCE IN PRECISIONAssuming that the ratios of responses ‘‘left’’, ‘‘right’’, and‘‘not sure’’ for each TC (i, j, k) converge as the number ofresponses tends to infinity, we may expect that the recon-structed impairment scales for the corresponding imagesequence also converge. In this subsection, we aim to assessthe precision of the reconstructions for given budgets of TCs.For this purpose, we consider the 95% confidence inter-vals (CI) for all of the reconstructed impairment scales.

For each of the 10 image sequences with motion blur,we had collected approximately 8900 responses for plain andalso for AZF-boosted TC with a span of at most 10 distortionlevels. For a given budget of responses, one may create asample of that size from each of these pools of 8900 responsesfor triplets, using random resampling with replacement. Fromeach such sample, a reconstruction can follow. For eachimage sequence and each budget of responses, we carried outthis procedure and computed the 95% confidence interval forthe resulting impairment scale of each image and recordedtheir lengths.We used budgets from 50 up to 10 000 responsesper sequence and collected 500 resamplings each time.

For a fair comparison of the lengths of the confidenceintervals derived from plain and boosted TCs, we must takethe sensitivity gain of boosted TC into account. The impair-ment scales reconstructed from boosted TC are approx-imately 7.5 times larger (SectionVII-A), and thereforethe corresponding confidence intervals should be scaledby 1/7.5 ≈ 0.13.

Figure 20 shows the resulting (scaled) lengths of the CI forthe images at distortion levels 5, 15, and 25, averaged over the10 sources, on a doubly logarithmic grid. For both methods,plain and boosted TC, the sizes of the confidence intervalsshrink by about two orders of magnitude when the budget ofresponses is increased from 50 to 10 000. The precision givenby reconstructions from 10 000 responses for plain TC can beachieved by only 300–400 responses with the AZF-boostedTCs. In other words, in this experiment, a single responsefor a boosted TC gave as much benefit in terms of resultingprecision as 25 to 33 responses for plain TCs.

138966 VOLUME 9, 2021


FIGURE 20. Comparative study on the precision of reconstructions from TCs. Impairment scales were reconstructed for each of the 10 image sequencesfrom sets of 500 random samples of a variable number of responses for plain and AZF-boosted TCs. The figure shows the average lengths of thecorresponding 10 confidence intervals for the stimuli at distortion levels 5, 15, and 25 as indicators of precision. The best achievable performance forplain TC, obtained from 10 000 responses per sequence, was surpassed by reconstructions derived from as few as 400 responses to AZF-boosted TCs.

TABLE 7. Convergence in SROCC for AZF-boosted and plain TC.

D. RESULTS: CONVERGENCE IN ORDERINGTo compare the convergence of the quality assessment usingboosted TCs with that for plain TCs, we followed the sameapproach as in the previous SectionVII-C. Using resam-pled data for given budgets of TC responses per sequenceof 31 images, we computed the impairment scale reconstruc-tions and their corresponding SROCC w.r.t. the ground truthdetermined by the distortion levels of the images. For eachsize of budget, we produced 500 resamplings and recordedthe median SROCC and the corresponding 95% CI (using thepercentile method). Finally, we averaged the median SROCCand lengths of the CIs over the 10 image sequences.

These results are listed in Table 7 and visualisedin Figure 21. The SROCC for plain TC is smaller than 0.9 forall sample sizes up to 2000 responses per sequence, while forAZF-boosted TC, the SROCC is above 0.9, even for samplesof only 50 responses per sequence. The SROCC of 0.9412 forplain TC using 10 000 responses is surpassed by the SROCCfor boosted TC with as few as 100 responses.

We demonstrate the advantage of boosted TC over plainTC by means of an example. We pick a source image andits 30 distorted versions, labelled by motion blur distor-tion levels 0 to 30. Then, for each plain and boosted TC,we choose samples of 100, 1000, and 10 000 randomly

FIGURE 21. This figure illustrates the data from Table 7. Impairmentscales were reconstructed from random samples of a variable number ofresponses for plain and AZF-boosted TCs. We show their (median)rank-order correlation (SROCC) with the ground truth ordering, averagedover image sequences from 10 sources. The upper and lower bound of CI(95%) was averaged over 10 sequences. The best achievable performancefor plain TC required 10 000 responses but was beaten by reconstructionsderived from as few as 100 responses to AZF-boosted TCs.

selected responses (with replacement), followed by recon-struction of the 31 impairment scales. Sorting the imagelabels for each reconstruction according to increasing impair-ment scales yields permutations of (0, 1, . . . , 30). See Table 8for the results. The table also shows the correspondingSROCC and the number of inversions in the permutations,which express the quality of the orderings (lower is better).

VOLUME 9, 2021 138967


FIGURE 22. The DMOS of the DCR study of Experiment III show asensitivity gain when assessing the perceptual image quality with smalldistortions up to about 0.75 JND (corresponding to level 3). The DMOSvalues are averaged over 70 sequences. 95% confidence intervals werecomputed for the DMOS of each distorted image. For each of thedistortion levels, the average length of the corresponding 70 CIs wascomputed. The figure shows the centerd CIs with these average lengths.The solid red curve is the corresponding sensitivity gain function.

The number of inversions is equal to the number of swapsrequired for sorting a permutation by the Bubble Sortalgorithm.

The best result for plain TC is from a sample of 10 000 TCresponses and has 40 inversions and an SROCC of 0.9331.It has about the same quality as a result obtained from asample of only 100 AZF-boosted TC responses which has42 inversions and an SROCC of 0.9484. With 1000 responsesfor boosted TCs, the ordering of the reconstruction is almostperfect, with only a single inversion left.

To summarise, in this experiment, each of the first100 responses for AZF-boosted TC was worth more than100 responses for plain TCs in terms of the resulting SROCCof the reconstruction. To obtain an SROCC of 0.95, ourboosting method was 100 times as efficient as plain TC.

VIII. EXPERIMENT III: BOOSTING FOR DEGRADATIONCATEGORY RATINGIn our third and last experiment, we briefly tested the potentialof boosting together with degradation category rating (DCR),which is one of the standard methods for subjective FR-IQA.

Since in the conventional DCR approach, the distortedand reference images are displayed side by side, it is notappropriate to show a flickering image in such a scenario.Hence, only still images can be considered. In ExperimentI, AZ-boosted TC provided the largest sensitivity gain forTC among the non-flickering boosting techniques. There-fore, here we repeated Experiment I, investigating the per-formance of plain and AZ-boosted comparisons applied

in the DCR setting, denoted as Plain-DCR and AZ-DCR,respectively.

The experimental setup, the quality control, and the outlierremoval for this study were as described earlier in SectionVand shown in Figure 9. As in Experiment I, we used all70 image sequences from 10 sources and 7 distortion types,each containing a source reference image and 12 increasinglydistorted images. The set of all Plain-DCR and AZ-DCRquestions was shuffled and distributed into a sufficient num-ber of HITs, each one also containing one test question forquality control. For each image, we collected 50 ratings. Thestatistics of the collected data is detailed in Tables 5 and 4.

A. RESULTS: RECONSTRUCTION AND SENSITIVITY GAINFigure 22 shows the DMOS for the two types of DCR, aver-aged over the 70 sequences. Both methods worked well. TheDMOS curve for AZ-DCR is above that for Plain-DCR andspans over a larger interval. Thus, on average, AZ-boostingprovided an increased sensitivity as anticipated. However,the gain in sensitivity is restricted to small distortions, up tolevel 3 (0.75 JND). For larger distortions, the sensitivity gainis less than 1, showing a saturation effect similar to A- andAZ-boosted baseline triplet comparison, compare Figure 16.

Note that the DMOS should ideally be equal to 0 at dis-tortion level 0, since when the reference image is comparedto itself, there is no difference between the two, and thedistortion should be rated ‘‘imperceptible’’ with a score of 0.However, from Figure 22, the DMOS at level 0 is not equal to0 but even larger than 1 for both of theDCR tests.We think thereason for this outcome is that in this experiment, observerswere instructed to expect to see distorted images on the rightside, and during the work on the assignments, this expectationwas fulfilled, almost always. Therefore, the participants inthe study may have been hesitant to declare that they couldnot detect any difference. So many rated the distortion fora displayed test image as ‘‘perceptible, but not annoying’’ oreven worse, although it actually was identical to the referenceimage.

B. RESULTS: PRECISIONWe evaluated the precision of the acquired DMOS results bycomputing their 95% confidence intervals. Figure 22 showsthe results for Plain- and AZ-DCR, where we have averagedthe full width of the confidence intervals over the 70 imagesequences. These average confidence intervals, based on upto 50 collected ratings per test image, range from ±0.220to ±0.295 on the 5-point DCR impairment scale. The confi-dence intervals for boosted AZ-DCR are slightly smaller thanfor Plain-DCR.

IX. RESCALING BOOSTED IMPAIRMENTS BY HYBRIDTRIPLET COMPARISONSOur boosting techniques of artefact amplification, zooming,and flickering were designed to perceptually magnify differ-ences between compared stimuli so that human observers areenabled to distinguish fine-grained distortion levels better.

138968 VOLUME 9, 2021


TABLE 8. Orderings from the reconstructions for source image SRC07 and motion blur distortion. Last column: Number of inversions.

The experiments of the previous sections have confirmed thisintended effect.

However, boosting amounts to a nonlinear scaling of per-ceptual distortion, as already apparent from Figure 14. More-over, this nonlinearity may depend on the distortion type andthe content of the source images. For example, using boostingby zooming and flicker, the impairment range of 3 JND unitsfor plain TC was stretched to 7 JND for the jitter distortion,but only to 5.5 JND for color diffusion, see Figure 15.This nonlinear scaling of perceptual distortion could be

disregarded when building FR-IQA datasets. After all, it isa commonly accepted practice to use DCR or reconstruc-tions from pair comparison for subjective quality assessment,although also theDMOS scale is not perceptually linear and isalso not proportional to the reconstruction from pair compar-isons. Moreover, as shown in Table 1, the creators of FR-IQAand FR-VQA datasets have applied various other methods forthe assessment of impairment scales, but there is no agreedupon standard that provides any particular scale as a commonground that other scales can be related to. Only recently,procedures were proposed to merge impairment scale valuesfrom different methodologies (pair comparison and categoryrating) and from different datasets, see [75]–[77].

In this section, we propose a similar approach to transformimpairment scale values from boosted triplet comparisonsback to scales obtained in the traditional way withoutboosting perceptual discrimination power. For this purpose,we construct a (nonlinear) monotonic transformation ofboosted scales for each image sequence. Thereby, we pre-serve the discrimination of fine-grained distortion differencesachieved by the boosting approach while simultaneouslyensuring that the ranges of transformed absolute impairmentscales match those that are obtained when comparing dis-torted images without boosting. In other words, the trans-formed impairment values reflect the perceived qualities ofthe original distorted images rather than the qualities of theimages with perceptually boosted distortions.

Towards this end, we propose a hybrid method by allowingfor a fraction of triplet comparisons without boosting. Thescale reconstruction from these plain comparisons providesa rough estimate of the desired impairment scales. We thenfit a smooth scalar transformation for each sequence of dis-torted images that maps the boosted scales to the targetscales in the least-squares sense. This transformation shouldbe smooth and monotonic so that the high-quality relativedifferences of close scale values and the overall orderingof the image sequences obtained from boosted comparisons

FIGURE 23. Example of the impairment scale recalibration by the hybridmethod with parameters K = 400 and α = 0.5. The data stems from theimage sequence for source image SRC06 and distortion type jitter,assessed by plain and AZF-boosted triplet comparisons in Experiment I.The hybrid method fits the reconstruction from (1− α)K = 200 boostedTCs (blue line with circles) to that from αK = 200 plain TCs (green dashedline), with the result shown by the black line with crosses. Forcomparison, the reconstruction from all 1360 plain TC responses is alsoshown (red dashed line).

are maintained. In this way, we recalibrate the scales fromboosted comparisons to follow those from standard compar-isons without boosting. By construction, this recalibrationis adaptive w.r.t. the source image contents and the type ofdistortion.

In Algorithm 3, we outline the hybrid method in theform of a pseudo code.4 Then we provide the results of itwhen applied to the comparisons that we had collected inExperiment I.

For our implementation of the hybrid method we chosethe 5-parameter logistic function of Equation (11) as fγ .We forced it to be monotonically increasing by the constraintβ1, β2, β4 ≥ 0.Figure 23 shows the results for the example of one

image sequence. For each of the 70 image sequences ofExperiments I, we sampled K = 400 random triplet

4In lines 8 and 9, the functions fγ , resp. fγ̂ , are applied to each componentof their arguments.

VOLUME 9, 2021 138969


Algorithm 3 Hybrid Method: Re-Calibration of Scales FromBoosted Triplet Comparisons1: Input: I0, . . . , IN F sequence with N distorted images2: Parameters: K , α F budget of comparisons, 0 < α < 13: Do (1− α)K boosted triplet comparisons4: Result: µboost

0 , . . . , µboostN F reconstructed scales

5: Do αK plain triplet comparisons6: Result: µplain

0 , . . . , µplainN F reconstructed scales

7: fγ : R→ R F select family of monotonic functions8: γ̂ = argminγ ||fγ (µboost)− µplain

||2F function fitting

9: Output: fγ̂ (µboost) F transformed impairment scales

FIGURE 24. Recalibration of the scale reconstruction from boosted tripletcomparison of Experiment I by the hybrid method. The right point cloud(blue circles) is the scatter plot of the reconstruction from theAZF-boosted triplet comparison versus those from plain tripletcomparisons. The recalibration by the hybrid method produced the leftpart of the scatters plot along the diagonal as intended for therecalibration. Note that for visual clarity, the figure truncates thereconstructions from boosted comparison to a maximum of 12 JND. Thereare additional points (blue circles) further to the right, which are notshown. See the main text for more details.

comparisons, half of them as plain triplet comparisons inStep 4 (α = 0.5). Since the result of the hybrid methoddepends on the chosen samples for the TCs, we repeatedthe hybrid procedure (random sampling and reconstruction,followed by recalibration) 100 times and kept only themedianresult w.r.t. the mean-square difference between the recali-brated reconstructed scales from boosted TC and the scalesfrom plain TC.

Figure 24 illustrates the resulting rescaled impairmentscales for all distorted images from Part A of ourdataset KonFiG-IQA. In Experiment I, we had obtained1360 responses for each type of triplet comparison andeach of the 70 sequences with 12 distorted images. Thefigure shows two scatter plots together. The first one (bluecircles) is for the scales, reconstructed from a random sampleof only 200 AZF-boosted TCs (K = 400, 1−α = 0.5) versusthose reconstructed from all 1360 responses per sequencefrom plain TCs. Due to the boosting, the resulting impairmentscales are much larger than those from plain comparison.

These raw scales from (1 − α)K = 200 boosted TCswere adaptively recalibrated by the hybrid method, usingan additional random sample of αK = 200 responses

TABLE 9. Recalibration for 200 AZF-boosted TCs per sequence.

(per sequence) from plain TC. As in the previous figure,we kept only the (mean-square difference) median result fromrecalibrated reconstructions of 101 random samples of thebudget ofK = 400 TC responses. The corresponding scattersplot for the recalibrated results is also included in the graph(black crosses). The recalibration translates each blue circlehorizontally to the corresponding black cross. Thus, the recal-ibration from boosted TCs conforms to the range of scalesobtained for plain TCs, i.e., represent the perceptual qualitiescorresponding to the original distorted images without anyboosting. Table 9 gives the corresponding numerical resultsfor the fitting procedure in the hybrid method. RMSE denotesthe root-mean-square difference, MAE stands for the meanabsolute difference.

We also examined the extent to which the performance ofthe recalibration by the hybrid method depends on the typeof distortion and the choice of the source image. For thispurpose, we computed the RMSE between the correspondingrecalibrated scales and those derived from all 1360 responsesper sequence, using plain comparison. The results are shownin Table 10. The medians of these mean-square differencesindeed do not vary strongly between source images or dis-tortion types. However, with only 200 boosted and plain TCresponses per sequence of 13 images, the variation betweendifferent random samples is larger, as shown by the standarddeviations listed in the table.

To judge the usefulness of the hybrid method, it is impor-tant to keep in mind that the purpose of the recalibration is notto achieve a ‘‘perfect’’ result of zero RMSE or an SROCCequal to 1, but only to match the range of the scales recon-structed from plain comparisons: Due to the increased sen-sitivity of the boosted image quality assessment, we expectthe details of the reconstructed scales from boosted TCs tobe more accurate than those reconstructed from plain TCswithout boosting.

X. DISCUSSION, LIMITATIONS, AND FUTURE WORKA. OTHER OPTIONS FOR BOOSTING IN FR-IQATo exploit the full potential of boosting in FR-IQA, otheroptions for boosting can be investigated.

1) TYPE OF ARTEFACT AMPLIFICATIONFor the artefact amplification, we have worked with the RGBcolor space. The RGB color space corresponds well to howimages are technically displayed on devices and how colorsare processed in the human visual system. However, the RGBspace is not a perceptually uniform color space and is notwell suited for human interaction. The HSV color space (hue,saturation, value) corresponds better to how people experi-ence color [78]. It separates the chromatic (hue, saturation)from the achromatic (value) color components. In the context

138970 VOLUME 9, 2021


TABLE 10. Performance of the recalibration in the hybrid method per source image and distortion Type: Median RMSE (and STDDEV) w.r.t. Scales from1360 plain TC per sequence.

of artefact amplification, this implies that the clamping ofthe value component in HSV space does not affect the colorappearance, while when working in RGB space, the clampingof an R, G, or B component does. Another option is to extendthe technique to a context-dependent one, which takes intoaccount the local JND (e.g., [79]) when amplifying the imagedistortion at each pixel.

2) AMPLIFICATION AND ZOOM FACTORSThe benefits of our boosting strategies depend on the choicesfor the amplification and zoom factors together with the zoomregions. We had made the choices heuristically, based onsome manual exploration. However, other amplification andzoom factors may provide improved results, depending on thecontent, distortion type, and test environment.

3) FLICKER FREQUENCYIn our flickering study, the displayed image buffer wasswapped between the reference and the distorted image8 times per second. In other words, the frequency of the visualsignal was 4Hz. This is different from [38], in which thetemporal TCSF suggests a contrast threshold of a flickeringfrequency of 8Hz. However, it should be noted that imagesare different from the test stimuli used in the experimentsregarding the TCSF. An interesting future work, therefore,is to characterize the TCSF for our IQA application. Further-more, buffer swap rates, different from 8 times per second,may increase the sensitivity for subjective IQA even more.

B. LIMITATION OF GENERAL TRIPLET COMPARISONIn Experiment II, we introduced general triplet comparisonsthat provided an improved sensitivity gain compared to base-line triplet comparisons, especially for larger distortions (2 to3 JND). However, there is an important limitation of thismethod. Such triplet comparisons aim at capturing relationsbetween perceptual distances of stimuli. The reconstructionof the corresponding quality scales w.r.t. a reference thenrelies on the assumption that the given sequence of stimuli canbemodelled as a subset of a one-dimensional Euclidean spacesuch that distances properly add up. Thus, if Ia, Ib, Ic denotethree stimuli with increasing impairment scales, then weexpect for the perceptual distances that d(Ia, Ib)+d(Ib, Ic) =d(Ia, Ic) holds. This assumption appears natural for each

image sequence derived for a single type of distortion, andtherefore general triplet comparisons, with or without boost-ing, may be expected to provide meaningful results.

However, general triplet comparisons are no longer appli-cable if wemix several distortion types in one image sequenceas was done in MDID [13], for example. In this case, con-sider two images with equal impairment scale, derived fromthe same reference image, but with two different types ofdistortion like JPEG compression and color diffusion. Then,clearly, these two distorted images are perceptually notice-ably different from each other, yet their difference in impair-ment is equal to 0. In [76] it was therefore suggested torename the JND measurement units of impairment scales to‘‘just objectionable differences’’.

The reason for this discrepancy is that a set of distortedimages, derived from different types of distortions or withmixed distortions together, cannot be expected to lie on aone-dimensional continuum in image space. In our futurework, we will study to what extent multidimensional scal-ing (MDS) methods can facilitate the application of generaltriplet comparisons to impairment scale reconstruction alsofor image sets derived from a reference image by multipledistortions. For example, one can consider maximum like-lihood difference scaling (MLDS, [63]) or stochastic tripletembedding (STE, [61]) for MDS, select a suitable embed-ding dimension, and finally derive impairment scales fromthe distances in the multidimensional embedding space. Theresults can be compared and validated with the correspondingDMOS values.

C. OPTIMAL ALLOCATION OF TRIPLET COMPARISONS INTHE HYBRID METHODIn Section IX we introduced a hybrid method that combinedboosted and plain triplet comparisons in order to recalibratereconstructed impairment scales to the traditional impairmentranges achieved without boosting. A fraction α of a fixedbudget of K comparisons was devoted to plain comparisons.For our empirical analysis, we have used α = 0.5. Themore responses we collect from boosted TCs, the better theaccuracy and precision of the resulting reconstruction. Butincreasing α reduces the number of auxiliary plain TCs andworsens the alignment with the scale range valid for impair-ment without boosting.

VOLUME 9, 2021 138971


To investigate the tradeoff between accuracy, respectivelyprecision, on the one side and the alignment with the groundtruth result achievable without boosting on the other side,we will carry out suitable simulations and experiments.As a first step we will rerun the computations as providedin Figures 23, 24, and Table 10 with variable parameters Kand α. After defining a suitable target functional that com-bines accuracy, respectively precision, with alignment qual-ity, we can estimate an optimal fraction α of plain TCs forany given budget of K TCs.

D. ADAPTIVE SAMPLING STRATEGIESThe confidence intervals of the reconstructed impairmentof quality scales generally will shrink as more and moreresponses to triplet comparison are collected. In our experi-ments, we used TCs (i, j, k) with an equal number of targetresponses per TC. The triplets were moderately restricted,e.g., to spans of 10 or 20 distortion levels in Experiment II.However, an adaptive sampling strategy based on informationtheoretic considerations may reduce the number of responsesrequired to achieve the same quality of the result. In such aprocedure, the expected information gain is considered thatcan be derived for the next response for a TC or the responsesfor a batch of several TCs. This expected information gainmay be maximized, thereby determining the next TCs to beposted to observers.

Such adaptive sampling approaches have already beenproven useful for pair comparisons, see, e.g., [75] and [74].A similar general adaptive framework for the assessmentof psychometric functions is QUEST+ [80]. We propose totailor such adaptive sampling strategies for the case of tripletcomparisons which would further improve the performanceof impairment scale assessment using boosted triplet com-parisons.

E. APPLICATION SCENARIOIn general, boosting strategies help to evaluate more con-servative JND thresholds and increase the discriminationpower of subjective image quality assessment. In the existingFR-IQA datasets, the distorted images were typically gen-erated by applying only a few sparse distortion levels to areference stimulus. In such cases, it may be expected thatsubjective FR-IQA by comparisons without boosting alreadyprovides reliable results. However, in many applications,assessing slight quality differences is desirable. For exam-ple, in video compression, fine-grained quality assessmentfor small impairment scales up to 1 JND would be desir-able for content providers that strive to satisfy most of theirconsumer clients. Our boosting strategies enable faster andless expensive reconstruction for such small distortion scales.If needed, the reconstructed scores of the stimuli with boosteddistortions subsequently can be mapped back to the JNDscale for the stimuli without boosted distortions, as shown inSection IX. We are currently applying the flicker techniquefor subjective assessment of the JND and the satisfied userratio (SUR) in image and video compression methods.

Another application of boosting strategies is robust water-marking. Recently, several methods using JND assessmenthave been proposed for robust imagewatermarking [81]–[84].Our boosting strategies would increase the visibility of thedistortions caused by a watermarking algorithm. Therefore,the watermarking algorithm can be optimized so that thewatermark distortions would not be visible even when thedistortion of the stimulus is boosted. Thus, this procedurewould result in more robust watermarking.

A further future research direction is the use of boost-ing strategies for the subjective evaluation of other imagingmodalities, such as stereoscopic imaging and screen content.

F. KonFiG-IQA: THE KONSTANZ FINE-GRAINED IQADATASETOur Konstanz Fine-Grained IQA dataset (KonFiG-IQA, PartsA and B) has been publically available online in our datasetrepository http://database.mmsp-kn.de. Part Acontains the 10 source images and corresponding dis-torted versions (7 distortion types at 12 levels, spaced at0.25 JND), resulting in 840 distorted images in total. We sup-ply the (MATLAB) code to boost the distortions with respectto a reference image using artefact amplification, zooming,and flicker. Part B provides the distorted images for motionblur only, however, with 30 levels of distortion, spaced at0.1 JND.

KonFiG-IQA also includes a large number of subjectiveresponses to triplet comparison and DCR ratings. Only thoseresponses and ratings that remained after the data clean-ing process and outlier removal are provided. For each TCresponse, we give a record

[source_id, distortion_type, (i,j,k),

response, time_stamp, time_used,

worker_id],

where response denotes the response left, not sure,right, and worker_id is an anonymous identifier ofthe corresponding observer. In total, there are 70 6914 and36 0544 responses to triplet comparisons for Parts A and B,respectively.

For Part A we also conducted a DCR study, yielding36 0554 ratings. For each one we provide a record

[source_id, distortion_type,

distortion_level, rating, time_stamp,

time_used, worker_id].

The last part of the dataset consists of the impairmentscales, reconstructed from the TC responses andDCR ratings.

The source code for generating the distorted images and forreconstructing impairment scales from triplet responses willbe provided on GitHub.

XI. CONCLUSIONFor many applications of full-reference image qualityassessment, reliable and precise subjective image quality

138972 VOLUME 9, 2021


measurement for small distortions near and also below thejust noticeable difference on the perceptual quality scale isimportant. In this contribution, we showed that the sensitivityof conventional assessment methods w.r.t. small changes inperceptual quality can be strongly enlarged by several boost-ing methods. For this purpose, we proposed artefact amplifi-cation, zooming, the flicker test, and their combinations.

We proposed triplet comparisons for the subjective qualityassessment and provided the details required to reconstructimpairment scales for distorted images. This reconstructionis obtained by maximum likelihood estimation, based on theThurstonian probabilistic model of image quality, thus com-patible with the state-of-the-art method applied for classicalpair comparison.

To assess the potential of the approach, we have createdthe first FR-IQA dataset, KonFiG-IQA, which was designedbased on perceptual criteria. Sequences of distorted imageswere generated with a fine-grained, perceptually equidistantspacing of distortion levels, only 0.25 JND (or 0.1 JND) apartfrom each other.

In a large crowdsourced field study, we collected over1.7 million responses to triplet comparison questions.We gave a detailed analysis of this data in terms of scalereconstructions, their accuracy, convergence, detection rates,the sensitivity gain achieved by the different boosting meth-ods, and more. Boosting methods proved to increase thediscriminatory power for the fine-grained dataset, allowingto reduce the number of subjective quality comparisons whileimproving the accuracy of the resulting relative image impair-ment scales in terms of SROCC w.r.t. the available groundtruth.

Increasing perceptual sensitivity by the boosting strategiesnecessarily implies that the obtained impairment scales arelarger than for conventional methods such as degradation cat-egory rating. However, these larger ranges also depend on therespective image contents and the type of distortion. In orderto map the high precision boosted impairment scales backto the ranges corresponding to the original distorted imageswithout boosting, we proposed a hybrid method that appliesa monotonic numerical transformation of scale values basedon a few auxiliary triplet comparisons without boosting.

Our boosting techniques pave the way to fine-grainedimage quality datasets, allowing for an increased number ofdistortion levels, yet with high-quality subjective ground truthannotations facilitated by an amplified perceptual discrimina-tion power.

REFERENCES[1] Methodologies for the Subjective Assessment of the Quality of Televi-

sion Images, document ITU-R Recommendation BT.500, 2019. [Online].Available: https://www.itu.int/rec/R-REC-BT.500

[2] Methods for Subjective Determination of Transmission Quality,document ITU-T Recommendation P.800, 1996. [Online]. Available:https://www.itu.int/rec/T-REC-P.800-199608-I

[3] Methods for the Subjective Assessment of Video Quality, Audio Quality andAudiovisual Quality of Internet Video and Distribution Quality Televisionin Any Environment, document1 ITU-T Recommendation P.913, 2008.[Online]. Available: https://www.itu.int/rec/T-REC-P.913-201603-I

[4] B. L. Jones and P. R. McManus, ‘‘Graphic scaling of qualitative terms,’’SMPTE J., vol. 95, no. 11, pp. 1166–1171, Nov. 1986.

[5] K.-T. Chen, C.-C. Wu, Y.-C. Chang, and C.-L. Lei, ‘‘A crowdsourceableQoE evaluation framework for multimedia content,’’ in Proc. 17th ACMInt. Conf. Multimedia (MM), 2009, pp. 491–500.

[6] M. Seufert, ‘‘Statistical methods andmodels based on quality of experiencedistributions,’’ Qual. User Exper., vol. 6, no. 1, pp. 1–27, Dec. 2021.

[7] L. L. Thurstone, ‘‘A law of comparative judgment,’’ Psychol. Rev., vol. 34,no. 4, p. 273, 1927.

[8] M. Perez-Ortiz and R. K. Mantiuk, ‘‘A practical guide and software foranalysing pairwise comparison experiments,’’ 2017, arXiv:1712.03686.[Online]. Available: http://arxiv.org/abs/1712.03686

[9] L. L. Thurstone, ‘‘Equally often noticed differences,’’ J. Educ. Psychol.,vol. 18, no. 5, p. 289, 1927.

[10] R. K. Mantiuk, A. Tomaszewska, and R. Mantiuk, ‘‘Comparison of foursubjectivemethods for image quality assessment,’’Comput. Graph. Forum,vol. 31, no. 8, pp. 2478–2491, 2012.

[11] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, andF. Battisti, ‘‘TID2008—A database for evaluation of full-reference visualquality assessment metrics,’’ Adv. Modern Radioelectron., vol. 10, no. 4,pp. 30–45, 2009.

[12] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola,B. Vozel, K. Chehdi, M. Carli, and F. Battisti, ‘‘Image database TID2013:Peculiarities, results and perspectives,’’ Signal Process., Image Commun.,vol. 30, pp. 57–77, Jan. 2015.

[13] W. Sun, F. Zhou, and Q. Liao, ‘‘MDID: A multiply distorted imagedatabase for image quality assessment,’’ Pattern Recognit., vol. 61,pp. 153–168, Jan. 2017.

[14] H. Men, V. Hosu, H. Lin, A. Bruhn, and D. Saupe, ‘‘Subjective annotationfor a frame interpolation benchmark using artefact amplification,’’ Qual.User Exper., vol. 5, no. 1, pp. 1–18, Dec. 2020.

[15] W. S. Torgerson, Theory and Methods of Scaling. New York, NY, USA:Wiley, 1958.

[16] S. Haghiri, F. A. Wichmann, and U. von Luxburg, ‘‘Estimation of per-ceptual scales using ordinal embedding,’’ J. Vis., vol. 20, no. 9, p. 14,Sep. 2020.

[17] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, ‘‘A statistical evaluation ofrecent full reference image quality assessment algorithms,’’ IEEE Trans.Image Process., vol. 15, no. 11, pp. 3440–3451, Nov. 2006.

[18] D. M. Chandler, ‘‘Most apparent distortion: Full-reference image qualityassessment and the role of strategy,’’ J. Electron. Imag., vol. 19, no. 1,Jan. 2010, Art. no. 011006.

[19] A. Zarić, N. Tatalović, N. Brajković, H. Hlevnjak, M. Lončarić,E. Dumić, and S. Grgić, ‘‘VCL@FER image quality assessment database,’’Automatika, vol. 53, no. 4, pp. 344–354, 2012.

[20] X. Liu, M. Pedersen, and J. Y. Hardeberg, ‘‘CID:IQ—A new image qualitydatabase,’’ in Proc. Int. Conf. Image Signal Process. (ICSIP). Cham,Switzerland: Springer, 2014, pp. 193–202.

[21] H. Lin, V. Hosu, and D. Saupe, ‘‘KADID-10k: A large-scale artificiallydistorted IQA database,’’ in Proc. 11th Int. Conf. Qual. Multimedia Exper.(QoMEX), Jun. 2019, pp. 1–3.

[22] F. De Simone, M. Tagliasacchi, M. Naccari, S. Tubaro, and T. Ebrahimi,‘‘A H.264/AVC video database for the evaluation of quality metrics,’’in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Mar. 2010,pp. 2430–2433.

[23] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,‘‘Study of subjective and objective quality assessment of video,’’ IEEETrans. Image Process., vol. 19, no. 6, pp. 1427–1441, Jun. 2010.

[24] F. Zhang, S. Li, L. Ma, Y. C. Wong, and K. N. Ngan, IVPSubjective Quality Video Database. The Chinese Univ. Hong Kong,Hong Kong, 2011. [Online]. Available: http://ivp.ee.cuhk.edu.hk/research/database/subjective/

[25] P. V. Vu and D. M. Chandler, ‘‘ViS3: An algorithm for video qualityassessment via analysis of spatial and spatiotemporal slices,’’ J. Electron.Imag., vol. 23, no. 1, p. 013016, Feb. 2014.

[26] J. Y. Lin, R. Song, T. Liu, H. Wang, and C.-C. J. Kuo,‘‘MCL-V: A streaming video quality assessment database,’’ J. Vis.Commun. Image Represent., vol. 30, pp. 1–9, Jul. 2015.

[27] Netflix.NFLXPublic Dataset. Accessed: Oct. 7, 2021. [Online]. Available:https://github.com/Netflix/vmaf/blob/master/resource/doc/datasets.md

[28] N. T. Blog. (2016). Toward a Practical Perceptual Video Quality Met-ric. [Online]. Available: https://netflixtechblog.com/toward-a-practical-perceptual-video-quality%-metric-653f208b9652

VOLUME 9, 2021 138973


[29] L. Jin, J. Y. Lin, S. Hu, H. Wang, P. Wang, I. Katsavounidis, A. Aaron,and C.-C. J. Kuo, ‘‘Statistical study on perceived JPEG image quality viaMCL-JCI dataset construction and analysis,’’ Electron. Imag., vol. 2016,no. 13, pp. 1–9, Feb. 2016.

[30] X. Shen, Z. Ni, W. Yang, X. Zhang, S. Wang, and S. Kwong, ‘‘A JNDdataset based on VVC compressed images,’’ in Proc. IEEE Int. Conf.Multimedia Expo Workshops (ICMEW), Jul. 2020, pp. 1–6.

[31] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavouni-dis, A. Aaron, and C.-C.-J. Kuo, ‘‘MCL-JCV: A JND-based H.264/AVCvideo quality assessment dataset,’’ in Proc. IEEE Int. Conf. Image Process.(ICIP), Sep. 2016, pp. 1509–1513.

[32] H. Wang, I. Katsavounidis, J. Zhou, J. Park, S. Lei, X. Zhou, M.-O. Pun,X. Jin, R.Wang, X.Wang, Y. Zhang, J. Huang, S. Kwong, and C.-C. J. Kuo,‘‘VideoSet: A large-scale compressed video quality dataset based on JNDmeasurement,’’ J. Vis. Commun. Image Represent., vol. 46, pp. 292–302,Jul. 2017.

[33] C. Fan, Y. Zhang, R. Hamzaoui, and Q. Jiang, ‘‘Interactive subjectivestudy on picture-level just noticeable difference of compressed stereo-scopic images,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.(ICASSP), May 2019, pp. 8548–8552.

[34] X. Liu, Z. Chen, X. Wang, J. Jiang, and S. Kowng, ‘‘JND-Pano: Databasefor just noticeable difference of JPEG compressed panoramic images,’’ inProc. Pacific Rim Conf. Multimedia (PCM). Cham, Switzerland: Springer,2018, pp. 458–468.

[35] Q. Huang, H. Wang, S. C. Lim, H. Y. Kim, S. Y. Jeong, and C.-C.-J. Kuo,‘‘Measure and prediction of HEVC perceptually Lossy/Lossless boundaryQP values,’’ in Proc. Data Compress. Conf. (DCC), Apr. 2017, pp. 42–51.

[36] J. Y. Lin, L. Jin, S. Hu, I. Katsavounidis, Z. Li, A. Aaron, and C.-C. J. Kuo,‘‘Experimental design and analysis of JND test on coded image/video,’’Proc. SPIE, vol. 9599, Sep. 2015, Art. no. 95990Z.

[37] B. W. Keelan and H. Urabe, ‘‘ISO 20462: A psychophysical image qualitymeasurement standard,’’ in Proc. Image Qual. Syst. Perform., Dec. 2003,pp. 181–189.

[38] A. B. Watson, Handbook of Perception and Human Performance.New York, NY, USA: Wiley, 1986, ch. Temporal Sensitivity, pp. 6.1–6.43.

[39] D. M. Hoffman and D. Stolitzka, ‘‘A new standard method of subjectiveassessment of barely visible image artifacts and a new public database,’’J. Soc. Inf. Display, vol. 22, no. 12, pp. 631–643, Dec. 2014.

[40] M. A. Elgharib, M. Hefeeda, F. Durand, and W. T. Freeman, ‘‘Videomagnification in presence of large motions,’’ in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4119–4127.

[41] A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, andW. T. Freeman, ‘‘The visual microphone: Passive recovery of sound fromvideo,’’ ACM Trans. Graph., vol. 33, no. 4, pp. 1–10, Jul. 2014.

[42] O. Johnston and F. Thomas, The Illusion Life: Disney Animation.New York, NY, USA: Disney Editions, 1981.

[43] H. Men, H. Lin, V. Hosu, D. Maurer, A. Bruhn, and D. Saupe, ‘‘Visualquality assessment for motion compensated frame interpolation,’’ in Proc.11th Int. Conf. Qual. Multimedia Exper. (QoMEX), Jun. 2019, pp. 1–6.

[44] Information Technology—Advanced Image Coding and Evaluation—Part2: Evaluation Procedure for Visually Lossless Coding, document ISO/IEC29170-2, 2015.

[45] R. S. Allison, L. M. Wilcox, W. Wang, D. M. Hoffman, Y. Hou, J. Goel,L. Deas, and D. Stolitzka, ‘‘75-2: Invited paper: Large scale subjectiveevaluation of display stream compression,’’ in SID Symp. Dig. Tech.Papers, vol. 48, no. 1, 2017, pp. 1101–1104.

[46] R. S. Allison, K. Brunnström, D.M. Chandler, H. R. Colett, P. J. Corriveau,S. Daly, J. Goel, J. Y. Long, L. M. Wilcox, Y. M. Yaacob, S.-N. Yang,and Y. Zhang, ‘‘Perspectives on the definition of visually lossless qualityfor mobile and large format displays,’’ J. Electron. Imag., vol. 27, no. 5,Oct. 2018, Art. no. 053035.

[47] Call for Proposals for a Low-Latency Lightweight Image Coding System,document J. ISO/IEC JTC1/SC29/WG1 (ITU-T SG16), JPEG, 2016.

[48] A.Willème, S. Mahmoudpour, I. Viola, K. Fliegel, J. Pospíšil, T. Ebrahimi,P. Schelkens, A. Descampe, and B. Macq, ‘‘Overview of the JPEGXS corecoding system subjective evaluations,’’ Proc. SPIE, vol. 10752, Sep. 2018,Art. no. 107521M.

[49] A. Sudhama, M. D. Cutone, Y. Hou, J. Goel, D. Stolitzka, N. Jacobson,R. S. Allison, and L. M. Wilcox, ‘‘85-1: Visually lossless compression ofhigh dynamic range images: A large-scale evaluation,’’ in SID Symp. Dig.Tech. Papers, vol. 49, no. 1, May 2018, pp. 1151–1154.

[50] V. Thirumalai, J. Ribera, J. Xiang, J. Zhang, M. Azimi, J. Kamali, andP. Nasiopoulos, ‘‘P-23: A subjective method for evaluating foveated imagequality in HMDs,’’ in SID Symp. Dig. Tech. Papers, vol. 51, no. 1, 2020,pp. 1415–1418.

[51] S. S. Mohona, L. M. Wilcox, and R. S. Allison, ‘‘Subjective assessment ofdisplay stream compression for stereoscopic imagery,’’ J. Soc. Inf. Display,vol. 29, no. 8, pp. 592–607, Mar. 2021.

[52] H. Lin, M. Jenadeleh, G. Chen, U.-D. Reips, R. Hamzaoui, and D. Saupe,‘‘Subjective assessment of global picture-wise just noticeable difference,’’in Proc. IEEE Int. Conf. Multimedia ExpoWorkshops (ICMEW), Jul. 2020,pp. 1–6.

[53] W. S. Torgerson, ‘‘Multidimensional scaling: I. Theory and method,’’Psychometrika, vol. 17, no. 4, pp. 401–419, Dec. 1952.

[54] R. D. Luce and E. Galanter, Handbook of Mathematical Psychol-ogy. New York, NY, USA: Wiley, 1963, ch. Psychophysical Scaling,pp. 245–307.

[55] D. M. Ennis, K. Mullen, and J. E. R. Frijters, ‘‘Variants of the method oftriads: Utridimensional thurstonian models,’’ Brit. J. Math. Stat. Psychol.,vol. 41, no. 1, pp. 25–36, May 1988.

[56] W. H. Press, B. Flannery, S. Teukolsky, and W. Vetterling, ‘‘Least squaresas a maximum likelihood estimator,’’ Numer. Recipes Fortran, Art Sci.Comput., vol. 2, pp. 651–655, 1992.

[57] F. A. Wichmann and N. J. Hill, ‘‘The psychometric function: I. Fitting,sampling, and goodness of fit,’’ Perception Psychophys., vol. 63, no. 8,pp. 1293–1313, Nov. 2001.

[58] L. T.Maloney andK. Knoblauch, ‘‘Measuring andmodeling visual appear-ance,’’ Annu. Rev. Vis. Sci., vol. 6, no. 1, pp. 519–537, Sep. 2020.

[59] C. Charrier, L. T. Maloney, H. Cherifi, and K. Knoblauch, ‘‘Maxi-mum likelihood difference scaling of image quality in compression-degraded images,’’ J. Opt. Soc. Amer. A, Opt. Image Sci., vol. 24, no. 11,pp. 3418–3426, 2007.

[60] C. Charrier, K. Knoblauch, L. T.Maloney, A. C. Bovik, and A. K.Moorthy,‘‘Optimizing multiscale SSIM for compression via MLDS,’’ IEEE Trans.Image Process., vol. 21, no. 12, pp. 4682–4694, Dec. 2012.

[61] L. van der Maaten and K. Weinberger, ‘‘Stochastic triplet embedding,’’in Proc. IEEE Int. Workshop Mach. Learn. Signal Process., Sep. 2012,pp. 1–6.

[62] G. T. Fechner, D. H. Howes, and E. G. Boring, Elements of Psychophysics,vol. 1. New York, NY, USA: Holt, Rinehart and Winston, 1996.

[63] L. T. Maloney and J. N. Yang, ‘‘Maximum likelihood difference scaling,’’J. Vis., vol. 3, no. 8, p. 5, Oct. 2003.

[64] D. Saupe, F. Hahn, V. Hosu, I. Zingman, M. Rana, and S. Li, ‘‘Crowdworkers proven useful: A comparative study of subjective video qualityassessment,’’ in Proc. IEEE Int. Conf. Qual. Multimedia Exper. (QoMEX),2016, pp. 1–2.

[65] F. Ribeiro, D. Florencio, and V. Nascimento, ‘‘Crowdsourcing subjectiveimage quality evaluation,’’ in Proc. 18th IEEE Int. Conf. Image Process.,Sep. 2011, pp. 3097–3100.

[66] Amazon Mechanical Turk (AMTurk). Accessed: Oct. 7, 2021. [Online].Available: https://www.mturk.com/

[67] D. McNally, T. Bruylants, A. Willème, T. Ebrahimi, P. Schelkens, andB. Macq, ‘‘JPEG XS call for proposals subjective evaluations,’’ Proc.SPIE, vol. 10396, Sep. 2017, Art. no. 103960P.

[68] J. L. Punch, B. Rakerd, and A.M. Amlani, ‘‘Paired-comparison hearing aidpreferences: Evaluation of an unforced-choice paradigm,’’ J. Amer. Acad.Audiol., vol. 12, no. 4, pp. 190–201, 2001.

[69] C. Leys, O. Klein, Y. Dominicy, and C. Ley, ‘‘Detecting multivariateoutliers: Use a robust variant of the Mahalanobis distance,’’ J. Experim.Social Psychol., vol. 74, pp. 150–156, Jan. 2018.

[70] S. Chawla and A. Gionis, ‘‘K-means–: A unified approach to clusteringand outlier detection,’’ in Proc. SIAM Int. Conf. Data Mining, May 2013,pp. 189–197.

[71] V. Hosu, H. Lin, and D. Saupe. (2018). IQA-Experts-300: A Pro-fessional Photographer Annotated IQA Database. [Online]. Available:http://database.mmsp-kn.de

[72] V. Hosu, H. Lin, and D. Saupe, ‘‘Expertise screening in crowdsourcingimage quality,’’ inProc. 10th Int. Conf. Qual. Multimedia Exper. (QoMEX),May 2018, pp. 1–6.

[73] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image qualityassessment: From error visibility to structural similarity,’’ IEEE Trans.Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

[74] J. Li, R. Mantiuk, J. Wang, S. Ling, and P. Le Callet, ‘‘Hybrid-MST:A hybrid active sampling strategy for pairwise preference aggre-gation,’’ in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2018,pp. 3475–3485.

[75] P. Ye, J. Kumar, L. Kang, and D. Doermann, ‘‘Unsupervised fea-ture learning framework for no-reference image quality assessment,’’in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012,pp. 1098–1105.

138974 VOLUME 9, 2021


[76] M. Perez-Ortiz, A. Mikhailiuk, E. Zerman, V. Hulusic, G. Valenzise, andR. K. Mantiuk, ‘‘From pairwise comparisons and rating to a unified qualityscale,’’ IEEE Trans. Image Process., vol. 29, pp. 1139–1151, 2020.

[77] A. Kaipio, M. Ponomarenko, and K. Egiazarian, ‘‘Merging of MOS oflarge image databases for no-reference image visual quality assessment,’’in Proc. IEEE 22nd Int. Workshop Multimedia Signal Process. (MMSP),Sep. 2020, pp. 1–6.

[78] K. N. Plataniotis and A. N. Venetsanopoulos, Color Image Processing andApplications. Springer, 2013.

[79] J. Wu, L. Li, W. Dong, G. Shi, W. Lin, and C.-C. J. Kuo, ‘‘Enhanced justnoticeable difference model for images with pattern complexity,’’ IEEETrans. Image Process., vol. 26, no. 6, pp. 2682–2693, Jun. 2017.

[80] A. B. Watson, ‘‘QUEST+: A general multidimensional Bayesian adaptivepsychometric method,’’ J. Vis., vol. 17, no. 3, p. 10, Apr. 2017.

[81] W. Wan, K. Zhou, K. Zhang, Y. Zhan, and J. Li, ‘‘JND-guided perceptu-ally color image watermarking in spatial domain,’’ IEEE Access, vol. 8,pp. 164504–164520, 2020.

[82] J. Wang andW.Wan, ‘‘A novel attention-guided JNDmodel for improvingrobust image watermarking,’’Multimedia Tools Appl., vol. 79, nos. 33–34,pp. 24057–24073, Sep. 2020.

[83] K. Zhou, Y. Zhang, J. Li, Y. Zhan, and W. Wan, ‘‘Spatial-perceptualembedding with robust just noticeable difference model for color imagewatermarking,’’Mathematics, vol. 8, no. 9, p. 1506, Sep. 2020.

[84] L. Qin, X. Li, and Y. Zhao, ‘‘A new JPEG image watermarking methodexploiting spatial JND model,’’ in Proc. Int. Workshop Digit. Watermark-ing (IWDW). Cham, Switzerland: Springer, 2019, pp. 161–170.

HUI MEN received the master’s degree from theDepartment of Computer Science, University ofSaarland, Germany, in 2016, and the Ph.D. degreefrom the Department of Computer and Informa-tion Science, University of Konstanz, Germany, in2021. She is currently a Postdoctoral Researcher inthe Department of Psychology, University of Mar-burg, Germany. Her research interests in includevisual quality assessment, crowdsourcing, frameinterpolation, and visual perception.

HANHE LIN received the Ph.D. degree from theDepartment of Information Science, University ofOtago, New Zealand, in 2016. He is currentlya Postdoctoral Researcher with the Departmentof Computer and Information Science, Univer-sity of Konstanz, Germany. His research inter-ests include machine learning and deep learning-based application, visual quality assessment, andcrowdsourcing.

MOHSEN JENADELEH (Member, IEEE) receivedhis Dr.rer.nat. degree from the Department ofComputer and Information Science, Universityof Konstanz, Germany, in 2019. He is currentlya Postdoctoral researcher with the University ofKonstanz. His research interests include imageprocessing, visual perception, machine learning,and crowdsourcing.

DIETMAR SAUPE was born in Bremen, Germany,in 1954. He received the Dr.rer.nat. degree inmath-ematics from the University of Bremen, Germany,in 1982.

From 1985 to 1993, he was an Assistant Pro-fessor with the Departments of Mathematics, firstwith the University of California at Santa Cruz,Santa Cruz, USA, and then with the University ofBremen, resulting in his habilitation. From 1993 to1998, he was a Professor in computer science with

the University of Freiburg, Germany, the University of Leipzig, Germany,until 2002, and since then, the University of Konstanz, Germany. He is thecoauthor of the book Chaos and Fractals (Springer-Verlag, 1992), whichwon the Association of American Publishers Award for Best MathematicsBook of the Year, the book The Science of Fractal Images (Springer-Verlag,1988), and well over 100 research articles. His research interests includeimage and video processing, computer graphics, scientific visualization,dynamical systems, and sport informatics.

VOLUME 9, 2021 138975

Documents

Subjective Image Quality Assessment With Boosted Triplet