Maximum-Entropy Psychophysics

MAXENT ROC 1

Running head: MINIMAL SDT

Bits of the ROC:Signal Detection as Information Transmission

Peter R. Killeen & Thomas J. Taylor

Arizona State University

Correspond with:

Peter KilleenDepartment of PsychologyBox 1 1 0 4Arizona State UniversityTempe, AZ 85287-1104

email: [email protected]: (480) 965-8544Voice: (480) 965-2555

MAXENT ROC 2

Abstract

The framework for detection and discrimination called Signal

Detection Theory (SDT) is reanalyzed from an information-theoretic

perspective. Receiver-operating characteristics (ROCs) for the iso-

informative processor describe arcs in the unit square, and lie close t o

those described by constancy of A’. Necessary and sufficient condit ions

for performance to fall on these arcs is that the p a y off matrix be hones t ,

and that as bias shifts changes in the information expected from yes

responses are complemented by those from n o responses. Asymmetric

ROCs require further characterization of the underlying distributions.

The simplest maximum-entropy distributions on the evidence axis a r e

exponential, and yield power-law ROCs that describe many data. T h e

success of ROCs constructed from confidence ratings shows that m o r e

information is available from the signal than the experimenter’s b ina ry

classification lets pass. Such ratings comprise category scaling of signal

strength. Power operating characteristics are consistent with Weber’s

law. Where Weber’s law holds, channel capacity on a dimension equals

the logarithm of its Weber fraction.

MAXENT ROC 3

Bits of the ROC:

Signal Detection as Information Transmission

Signal Detection Theory (SDT) and Information Theory (IT) were t h e

jewels in the crown of 2 0t h century experimental psychology. As a n

avatar of statistical decision theory, SDT provided a technique fo r

reducing a 2x2 table of relations between stimulus and response in to

measures of detectability and bias, of sensitivity and selectivity, t he reby

untangling a 100-year old confound. It has been well popularized b y

Swets (e.g. Swets, Dawes & Monahan, 2000) and thoroughly analyzed b y

Macmillan and Creelman (1991). Information theory, the bril l iant

invention of Claude Shannon (Shannon, 1949), provided algorithms fo r

quantifying the amount of information t r ansmitted by a response.

Although both theories were formulated at the same time (the la te

1940s), and both concern similar phenomena (quantifying the accuracy

of imperfect discriminations), there has been very little use of o n e

theory to reinforce and complement the other. By the end of the 2 0t h

century, SDT remains an important theory while IT is rarely ment ioned.

Classic SDT (here, CSDT) is an application of Thurstone scaling (see,

e.g., Juslin & Olsson, 1997; Lee, 1969; Luce, 1977; Luce, 1994). Whereas many stimuli a r e

susceptible to measurement in physical units, such as decibels o f

intensity, some are not. These latter are often of most central concern t o

society, involving measures of complex stimuli such as beauty, qual i ty

MAXENT ROC 4

of life, impac t of punishment. Thurstone suggested a metric for t h e

distance between stimuli that would embrace both the simple a n d

complex: The unit of distance between stimuli would be the s t a n d a r d

deviation of the percept associated with the stimulus. Thurstone called

the distribution of perceptions issuing from a stimulus a “discriminal

process”, and its standard deviation σ the “discriminal dispersion”. Two

such processes are shown in Figure 1, issuing from two stimuli.

++ Figure 1 (discrim disp) and Table 1 (calc) about here ++

Tables 1 and 2 give the data from which such machinery is inferred.

Table 1 shows the joint probabilities of signals and responses. In Table

2, these are divided by the row marginals to give the condit ional

probabilities of responding “A” or “B” given the stimulus value.

Although here viewed as a symmetric discrimination, the origin of CSDT

was in detection tasks, where Sa was the background, or noise st imulus.

This gave rise to the terminology of Correct Rejection (CR) for responses

in the RaSa cell, and Misses (M) for responses in the RaSb cell.

++ Tables 2 & 3 about here ++

Table 2 descr ibes the performance from the perspective of a n

experimenter, one who knows the stimulus condition and assays t h e

probability of a response. It is useful for characterizing the behavior o f

the detector. In the applications of SDT, we are usually given t h e

response--the verdicts of radiologists, juries, and children who c r y

“wolf”-- and wish to know the probability that a signal was in fact

present. This requires dividing the cells in Table 1 by their co lumn

MAXENT ROC 5

marginals, yielding Table 3. One may go between Tables 2 & 3 by using

Bayes’ theorem to convert the arguments of a conditional probabili ty.

Table 3 is more user-friendly, in that consumers of SDT analyses a r e

seldom given the state of nature, and wish to evaluate that state, n o t

characterize the detector. The impact of these different

conditionalizations can be quite different: A child who always cries

“wolf” when their prevalence is 1% will be assigned a hit rate of 100 %

by the conventional Table 2, but only 1% by Table 3. No matter h o w

Table 1 is conditionalized it has two degrees of freedom, and n o

evaluation of a discrimination is complete without reporting b o t h

accuracy of affirmatives and accuracy of negatives. Table 3 is o f ten

more convenient for information-theoretic analyses.

CSDT invokes normal discriminal processes to translate t h e

probabilities in Table 2 into the two measures of theoretical interest (d’

and C). Other distributions are reviewed by Egan (1975). The data a r e

often consistent with these assumptions. However, the discr iminal

processes are never observed, and carry more degrees of freedom t h a n

the data they represent. There are five degrees of freedom available i n

constructing Figure 1: The means and standard deviations of signal a n d

noise distributions, and the location of the criterion. These over-

endowed distributions are then slimmed down by stipulating an origin

and unit for the perceptual abcissae. The origin is set either at the m e a n

of the first percept, or halfway between the means of the two percepts

(as inferred from the data). The unit--the standard deviation--is set t o

1.0. This leaves the scale value of the second stimulus, d’ and t h e

MAXENT ROC 6

location of the criterion, C, as the recovered parameters that re-present

the information found in the hit and false alarm rates. For the data i n

Table 1, d’ = 0.385 - (-0.675) = 1.06. If the origin is placed at the m e a n

of the two distributions, then C = -(z(H) + z(F))/2. Macmillan a n d

Creelman (1991, Equation 2.2) include the factor 1/2 so that the range of t h e

bias statistic is the same as that of d’. For the data in Table 2, C = -

(0.385 -0.675) = 0.29. As shown in Figure 1, the criterion is slightly

above the mean of the percepts, indicating a conservative criterion: T h e

observer is more likely to reject a marginal perception as noise than t o

accept it as a signal.

CSDT was a trail breaking innovation. Now standing near t h e

summit, a glance back shows that CSDT did not pick out the most d i rec t

route to the goal of representing discrimination performance. Too m u c h

that is unverifiable is assumed, only to be later nullified. Alternative

nonparametric measures of sensitivity and bias have been developed

out of a condign sense of parsimony. Macmillan and Creelman (1996)

reviewed these alternatives in an article whose title h e d g e d

“nonparametric”, because the measures reviewed either made subt le

assumptions of underlying distributions or mechanisms--or were at least

consistent with such distributions.

Subsumption Psychophysics

It this paper we m a ke assumptions about mechanisms a n d

distributions in incremental fashion, in the style of Brooks (1991), w h o

coined the term subsumption archi tecture to describe such a bottom-up

MAXENT ROC 7

approach. Build until it breaks, then see what additional is necessary.

The conceptual tool that permits this approach to be applied to signal

detection theory is the maximum entropy formalism (MEFJaynes, 1986;

Skilling, 1989; Tribus, 1969). The MEF stipulates that inference should be ba sed

on statements of everything known, with all other constraints maximally

random (i.e., having maximum entropy). If they are not maximally

random, then we are implicitly making additional inferences about the i r

nature. It is the goal of MEF to make all such knowledge explicit, leaving

nothing hidden in implicit assignments of parameters or distr ibutions.

We instantiate the MEF for detection by (a) describing a detect ion

theory that makes no assumptions concerning underlying distributions;

then (b) describing one t h a t invokes underlying one-parameter

distributions of signal strengths; and then (c) describing one t h a t

invokes underlying two-parameter distributions of signal s t rength.

Minimal SDT

Assume that observers attempt to maximize performance, given t h e

constraint imposed by limits on the information available from t h e

stimuli. This goal is equivocal until “maximal performance” is defined.

Value. Whenever an experimenter stipulates proper behavior for a n

observer (e.g., “Respond B only when you are sure you have observed

the stimulus”), they are imposing an index of merit for t he i r

performance. Often this is vague, as in the example given. One of t h e

many important contributions of CSDT was its emphasis on the explicit

assignment of indices of merit to per formance. An example is given i n

MAXENT ROC 8

Table 4, where the entries indicate the values assigned to each of t h e

outcomes. For instance, an experimenter may provide points ,

convertible into goods, for performance in the following manner: v(CR)

= v(H) = +5; v(F) = -3; v(M) = -1. This would generate a slightly

conservative bias in subjects attempting to maximize their expected

payoff. The perceived utility of the goods is, however, often a nonl inear

function of the points assigned (Kornbrot, Donnelly & Galanter, 1981). Some

subjects, motivated by a sense of propriety that outweighs the payoff

matrix, may attempt to maximize their % correct. Because of t h e

potential ambiguity of what subjects may be maximizing, po in t

predictions are seldom made. Instead, what is predicted is the nature o f

the curve that describes the locus of points p(H), p (F) in the u n i t

square, and the changes in parameters of that curve, or of t h e

observer’s location on it, with changes in the payoff matrix or t h e

discriminability of the s t imulus (see Figure 2) .

++ Insert Figure 2 (ROC) and Table 4 (payoff) around here +++

Symmetry. Consider the case in which there is no reason to t h ink

that Sa is qualitatively different than Sb, so that it is arbitrary which is

called A and which B, and thus arbitrary which conditional probabil i ty

in Table 2 is called hit and which correct rejection. Switching those

labels gives the open circle shown in Figure 2 as an equally valid locus

for the data; it is where the data would be found if the only impor ta n t

distinction between the two stimuli were the labels the exper imenter

assigned to them, and those could be arbitrarily reassigned. What these

two data points have in common is that they convey an equal amount o f

MAXENT ROC 9

information from the stimulus through the observer to t h e

experimenter. We now generalize this relation.

Information

The related concepts of randomness (entropy) and its reduc t ion

(information) have been given explicit formulation only in this cen tu ry

by Brillouin, Cox, Jaynes, Weiner, and most importantly, Shannon. Brief

histories of these ideas by major contributors are (Tribus, 1979) and (Jaynes,

1979). In particular, Jaynes reformulated both statistical mechanics a n d

inferential statistics using MEF and Bayes’s Theorem. Because

information is the central concept in this regrounding of SDT, i t

requires explication.

Entropy is a thermodynamic measure of disorder, which changes a s

a function of the energy added to a system relative to its temperature. I t

is intimately related to information, which is a measure of the reduc t ion

of entropy by some operation. Shannon’s (1949) key insight was t h e

development of entropy theory for the measurement of informat ion

transmission. “Shannon’s paper ... takes its rightful place alongside t h e

works of Newton, Carnot, Gibbs, Einstein, Mendeleev and the o t h e r

giants of science on whose shoulders we all stand (Tribus, 1979, p. 10).

Information is a relative concept; it is relative to context, and to t h e

state of the receiver. A coded message may look completely r a n d o m

until we are given a key (context), which permits the extraction o f

useful information. The amount of information is not the same as i ts

value. Small amounts of information may be of greater value than t h a t

derivable from encyclopedias: Lamps in a belfry may be inscrutable

MAXENT ROC 10

without the key “One if by land, two if by sea”, in which context t h e y

provide approximately one bit of very important information. They

would provide less than a bit if the route of invasion were a l ready

known, or known with some probability; they would provide more t h a n

a bit if the timing or color or brightness of the signal conveyed

additional information, such as distance or magnitude of the force.

Because information is defined as change--either as a difference i n

discrete systems or as a differential in continuous systems--it is a

process/behavioral construct, rather than a content/cognitive

construct. Information does not reside in the source, nor in t h e

message, nor in the communication channel; nor in the receiver. I t

resides nowhere. Information is the reduction in the uncer ta in ty

(entropy) of a response by use of a stimulus. Books do not conta in

information. Books may reduce the uncertainty--entropy--of t h e

reader’s response to questions such as “who killed the white whale”.

The book informs the response, but does not conta in information. T h e

book is a key that permits the student to decipher the correct answer t o

the question. The font of the text is un in fo rmative. Unless the ques t ion

concerns typography. The number of chapters is uninformative. Unless

the game is trivial pursuit. No specification of the information value of a

stimulus such as a book can be made without knowledge of the range o f

possible ques tions and answers (their entropy), and the degree to which

the answers are less random than they would be without that st imulus.

To say a book is informative means that it will permit the reader t o

respond to a wide range of relevant (to the reader) quest ions in a non-

MAXENT ROC 11

U x p pi i( ) log ( )= −∑ 2

random manner. An observer in a psychophysical experiment does n o t

so much have information, but rather t ransmits information f r o m

stimulus to the experimenter, who measures it by evaluating Table 1 .

Entropy tells us how much variability a system, or parts of it, contains.

Information is a relation between the entropy of two or more parts of a

system: it tells us how much of the variability in one part is correla ted

with the variability in ano ther .

Measuring information. The Shannon-Weiner measure of the

information transmitted from stimulus to response, t h e

transinformation (T), is the amount by which the channel reduces t h e

maximum entropy--or uncertainty (U)--in a stimulus-response matrix. If

there were no correlation between stimuli and r e s ponses (i.e., no s h a r e d

information), the cell entries in Table 1 could be predicted b y

multiplying the marginal probabilities, obtaining p(CR) = p(M) = .275;

p(F) = p(H) = .225. This is the same tactic used to calculate the expected

cell entries in calculating a chi-square statistic. To calculate the en t ropy

of such an uninformative matrix, apply the classic informat ion

transformation to the cells:

1 .

and then reapply it to the actual matrix. The difference between these is

the information transmitted b y the response. This may be concisely

written as:

MAXENT ROC 12

An alternate way of calculating information transmission provides a

different perspective on its meaning. Transinformation (T) is t h e

amount by which our uncertainty concerning a stimulus is reduced by a

response:

2 . T = U (S) - U(S|r)

U(S) is the entropy of the stimulus as measured by Equation 1. For

the data shown in Figure 2, the probability of a signal was constant a t

.5, giving a value of 1 bit for U(S). The variable U(S|r) tells us h o w

uncertain we are about the presence of a stimulus once we know which

response occurred; it is called the equivocation of transmission (see,

e.g., Attneave, 1959). It is calculated by applying Equation 1 to t h e

conditional probabilities in Table 3 1. The equivocation depends on t h e

expected equivocation from both yes and n o responses (Equation 3):

3 . T = U (S) - [ p(y )U(S|y) + p ( n )U(S|n)]

1 It is also possible to calculate T from Table 2, as the entropy of the response less the expectedambiguities (U(r|S)) of the stimuli. Because U(S) is often fixed for an experimentl condition, Equation 3is more interpretable.

Ti,j = ∑

i

pipj∑j

log2

pi, j

pipj

MAXENT ROC 13

Application to the data in Table 3 yields a value of T = 0.12 bits.

Conserving informat ion. The transinformation that the observer

communicates from stimulus to response is T. If the stimuli a r e

indiscriminable, T = 0. If two equiprobable stimuli are perfectly

discriminated, T = 1. The observer might choose to communicate less

information if so motivated by the payoff schedule. Call a payoff matr ix

honest if, in Table 4, v(CR) > v(F), and v(H) > v(M). An honest payoff

matrix is one that rewards the observer more for telling the truth t h a n

for lying. A dishonest payoff matrix could generate data that fell below

the diagonal in Figure 2, as could confusion concerning the appropr ia te

use of response categories. In such cases of disinformation, T does n o t

go negative (as does d’). The performance of the observer remains

informative; it merely requires that the experimenter understand t h a t

yes means n o, and vice versa; examples abound .

Under what conditions will an observer operating at the filled circle

in Figure 2 move to an operating point closer to the diagonal? Assume

the payoff remains honest and effective, as does t he discriminability o f

the stimuli. If the observer decreases the hit rate while holding CR r a t e

constant, payoff will decrease; if the observer increases false alarm r a t e

while holding hit rate constant, payoff will decrease. There is neve r

motivation to operate inside the observed (maximum) level o f

performance. Honest payoff matrices force the observers away from t h e

bottom right corner of the graph to the limit of their ability. This l imit

is called the channel capacity.

MAXENT ROC 14

Under honest payoff matrices, it is thus always to the subject’s

advantage to respond in a manner that maximizes informat ion

transmission. Figure 2 shows as dots the loci that do this for an observer

transmitting 0.12 bits, and for another transmitting 0.50 bits.

Characteristics of T. The relation between the traditional measure d’

and T is shown in Figure 3. For d’s of less than 3, the informat ion

transmitted, T, is approximately 10% of the square of d’. Isoinformation

contours are very similar to Pollack and Norman’s (1964) nonparamet r ic

ROC: The smooth curves in Figure 2 describe the performance t h a t

maintains a constant A’.

Transmitted information T is approximately distributed as chi-

square, knT ≈ Χ2, were n is the number of observations and k = 2ln[2] ≈

1.386 (McGill, 1954). If the data in Table 1 were based on 100 observations,

then with T = 0.12, Χ2 ≈ 16.6, which for 2 degrees of freedom is

significantly greater than zero (p < .01). The exact chi-square for th i s

matrix is Χ2 = 16.2. Measured values of T are biased estimators of i ts

true value. Miller (1955) has shown that T’ = T - (r-1)(c- 1 ) / k provides

an unbiased estimate when n is not too small; r is number of rows, c

number of columns, and k is 2ln[2].

+++ Insert Figure 3 (d’ vs. H) around here +++

Conditions for constancy of T. What accomplishes the shift along t h e

Information Operating Characteristic (IOC) in Figure 2? As payoff varies,

observers can maximize their earnings by shifting the proportion of yes

and n o responses, which affects the informativeness of those responses.

MAXENT ROC 15

For transmitted information (T) to remain constant, as the probabil i ty

of a yes response changes, any concommitant gains or losses i n

p ( y )U(S|y) must balance the losses or gains in p ( n )U(S|n). For t h e

isoinformation loci shown as dots in Figure 2, these complements a r e

plotted in Figure 4 .

++ Insert Figure 4 around here +++

More generally we may state that, given constant signal

probabilities, in order for T to remain constant over shifts in bias

requires that the change in the expected equivocation of yes responses

complement any changes in the expected equivocation of n o responses:

d [ p(y )U(S|y)] /d[ p(y)] = - d[ p( n )U(S|n)] /d[ p(y)]

This is a strong constraint: it requires symmetric, congruen t

densities, such as those shown in Figure 1 .

Bias. When the payoff is increased for yes responses the observer

can maximize payoff by increasing the number of yes responses. But if

the probability and discriminability of the stimuli don’t change, so t h a t

U(S) and T remain constant, then Equation 3 shows that as p (y )U(S|y)

increases p ( n )U(S|n) must decrease. An equilibrium will be f o u n d

somewhere on the IOC that maximizes payoff. The difference be tween

the average equivocations of yes and n o responses provides a measure

of bias:

4 . B = p [n]U[S|n] - p [y]U[S|y].

MAXENT ROC 16

B is positive for conservative biases, zero on the negative diagonal o f

the ROC, and negative for liberal biases. It may be normalized to r ange

between -1 and +1 by dividing it by the entropy of the stimulus m i n u s

the transinformation, U(S) - T:

4’. B’ = B/ ( U(S) - T) .

For the exemplary data in Figure 2, B’ = B/(1.0-0.12). Figure 4

shows that as bias shifts, the transfer of uncertainty from yes to n o

responses falls along a diagonal with a slope o f -1, thus conserving

transinformation--the distance between the locus of the points and t h e

diagonal. Conversely, for information to be conserved, the change i n

information from a yes response as its probability is varied must equa l

the complement of that from a n o response. Isoinformative opera t ing

characteristics effectively describe some experimental data, as shown i n

Figure 5. The loci of isoinformative points in this space closely

approximate the arc of a circle. The function is not visually

discriminable from that described by the nonparametric statistic A’

(Grier, 1971; Pollack & Norman, 1964).

+++ Insert Figure 5 (G&S symmetric) around here +++

There are, however, many data not well described by isoinformation

contours, such as some reported Green and Swets in their Figure 4-5,

and replotted in Figure 6. The deviations from isoinformation occur

both as skew, shown there, and as performance that is more concave

MAXENT ROC 17

than permitted by isoinformative curves. To deal with these deviations

it is necessary to impute structure to the information processor.

+++ Insert Figure 6 (G&S skewed) around here +++

Imputing Mechanism

The isoinformation contour shown in Figure 6 misses the d a t a

systematically. This indicates that the observer is not able to main ta in

constant information transmission while increasing the proportion o f

“yes” responses. The assumption of symmetry of stimuli has failed:

there is some intrinsic order to the stimuli that makes it impossible t o

conserve information under a symmetric change in bias. What is t h e

minimal structure that can be added to similarly hobble the i m p u t e d

discrimination mechanism? The dependent variables in these

experiments are a pair of probabilities; such probabilities are funct ions

on random variables. It has sufficed to describe the stimuli as a rb i t ra ry

binary variables. It no longer does.

Knowing nothing about the stimuli (the random variables) that gave

rise to p(H) and p (F) other than that they assume a finite set of values,

the most general assumption that one can make about their dis t r ibut ion

is that it is uniform--if the variable can take k values, the probability o f

any one is 1/k. The distribution in ROC-space that minimizes deviat ion

from this distribution is a straight line issuing from 0,0 to the d a t a

point, and then extending to 1,1. But once such an operating point is i n

hand, it provides an estimate of the mean of both distributions. Given

this estimate, the most general, least constrained a posteriori

MAXENT ROC 18

distribution is the geometric (Kapur, 1989). The geometric distribution is

also the maximum entropy (maxent) distribution for a random variable

that can assume a countably infinite number of states, given

specification of the mean. In the continuous limit, the geometric

approaches the exponential, which is the maxent distribution on t h e

positive continuum, given knowledge of the mean. In the case of a finite

upper limit, the maxent distribution is a renormalized exponential .

Figure 7 shows the imputed distribution of two s t imuli about which all

we know is their means .

+++ Insert Figure 7 around here +++

On each trial of the detection experiment an event--stimulus o r

noise--is presented to the observer, on whom then impinges some

instance of the random variable drawn from o ne of these distr ibutions.

Because the observer does not know a priori which distribution was

sampled, she confronts a distribution that is a mixture of these two,

looking like the average of the ordinates of the two curves shown i n

Figure 7. If the observer employs the standard decision-theoretic

strategy of responding yes whenever the event exceeds some cri ter ion

value (e.g., a value of C = 7 in Figure 7) and n o otherwise, a s imple

version of CSDT results. The probability of a hit, p(y|S), is the integral o f

the Signal distribution to the right of C: exp(-C/µ S), where µS is the m e a n

of the signal distribution. The probability of a false alarm, p(y|N), is t h e

integral of the noise distribution to the right of C: exp(-C/µ N). Solving

for C and substituting gives the equation for the ROC:

MAXENT ROC 19

5 .

The probability of a hit is a power function of the probability of a

false alarm, with an exponent β equal to the ratio of the mean of t h e

signal distribution to that of the noise distribution. When these a r e

equal, t he ROC lies along the major diagonal, as no discrimination is

possible. When the signal to noise ratio is large, the ROC rises quickly

toward the upper left corner before bending over toward 1,1. T h e

dashed line through the data in Figure 6 is a power ROC.

Green and Swets (1966, p. 69)recognized the importance of t h e

exponential as an underlying discriminal process:

an exponential distribution has, we feel, many advantages

over the Gaussian assumption with unequal variance.

First, the decision axis is monotonic with likelihood rat io.

Second, this distribution arises in a natural way in m a n y

counting processes and may, therefore, genera te

interesting hypotheses about the sensory mechanisms.

Third, and equally important, it is a one-parameter

distribution, and thus the ROC curve can be summar ized

by one parameter rather than by two.

Why was the exponential not pursued? Their next sentence tells:

“The parsimony gained, however, entails the risk that the n e w

distribution may fit fewer data than the two-parameter models.”

p y|S = p y|N1/β

with β = µS/µN

MAXENT ROC 20

Possibly; but Figures 8 and 9 show that the exponential assumption is i n

fact robus t23.

+++ Insert Figures 8 & 9 (Swets, Norman) around here +++

Characteristics of the power ROC. The parameter of the power

function β is the r atio of the means of the inferred signal and noise

distributions. As described below, its logarithm gives the entropy of t h e

ROC. It conveys the same information as d’, as it increases from 0 a s

discriminability improves. Using a logistic approximation to the normal ,

for unbiased observers and processes with equal dispersions, a measure

equivalent to d’ is ln[H/ (1 -H)] (Luce, 1 9 6 3 ). Figure 10 shows t h e

square of this index plotted as a function of the parameter of the power

characteristic, f or hit rates ranging from .5 to .95. An alternate index o f

merit is the area under the ROC, A. Green and Swets ( 1966) showed

that under very general considerations, this area predicts the pe rcen t

correct in a two-alternative forced choice task. Integrating the power

function gives:

6 . A = µS /(µ S + µN) = β/(1 + β).

Conversely, the parameter of the power function can be infer red

from the percent correct as:

β = A/(1-A) ,

3 The ROC curves shown in these figures minimize the sum of squares deviation between data andcurve on each axis. This was accomplished by appending to the data files a replicate with ordinates andabcissae exchanged, and simultaneously minimizing the error variance around p(F) = p(H)^β for thatappendix.

MAXENT ROC 21

where A is estimated from the 2AFC task. Egan (1975) provides a

thorough analysis of power ROCs. Equations 4 and 4’ continue to b e

useful as a measure of bias in the information transmitted. Information

transmission is greater for conservative biases, where the ROC is t h e

greatest distance from the diagonal. To maximize informat ion

transmission the observer cannot be unbiased.

+++ Insert Figure 10 (d’ vs. beta) around here +++

The entropy of the inferred signal-to-noise ratio, ln [β], is plotted i n

Figure 11 as a function of the relative amplitude of signal and noise fo r

the data of two of Norman’s (1963) observers. Notice that in the bes t

case the entropy approaches 4 bits (β ≈ 16); but with observers

restricted to a binary judgment, the most information that can b e

transmitted is 1 bit .

+++ Insert Figure 11 (Norman’s info vs. dB) around here +++

Weber’s law. In studies of sensory discrimination it is often f o u n d

that the dispersion of judgements is p ropor t ional to the mean of t h e

stimuli: Over a large range, the coefficient of variation of judgments is

constant. The Thurstone paradigm shown in Figure 1 assumes equal-

variance for these processes (this assumption of independent r a n d o m

variables with equal variance is known as Case V). Weber’s law is

inconsistent with this simplifying assumption. In CSDT studies t h e

MAXENT ROC 22

standard deviation of the inferred signal distribution is often found t o

be proportional to the mean signal strength (e.g., Green & Swets, 1966 ,

p. 407; (Jeffress, 1970; Markowitz & Swets, 1967), consistent with Weber’s law.

Symmetric densities such as the normal or logistic appear as s t ra ight

lines on probability axes, with slopes of m = σN/ σS and intercepts of I =

(µS - µN)/ σS. Under Weber’s law, σ = w µ; it follows that m = 1 - w I: Slopes

should be a linearly decreasing function of discriminability. This is also

true for power ROCs, which are slightly concave up in logit coordinates,

and have slopes of m = 1 /β at the intercept, which flatten as t h e

intercept increases, as m = ln(1-e-I] / ln[2] .

The unequal-variance normal distributions typically used as t h e

inferred discriminal process have the inconvenient property that t he i r

densities cross at two points, entailing that performance must fall below

the major diagonal for sufficiently liberal criteria. Such improper ROCs

are seldom observed (cf. Blough, 1967). This is why Green and Swets

admired the “monotonic decision axis” of the exponential, which b o t h

entails Weber’s law and remains proper in modeling it. Thus, t h e

standard implementation of CSDT reinforces Weber’s law, which in turn

undermines that standard implementation. For exponential processes,

the standard deviation automatically increases with the mean, wi thout

forcing improper ROC curves. This is also true for the Rayleigh a n d

Weibull distributions, which are discussed below.

Symmetry vs. simplicity. Power ROCs arise from the simplest o f

assumptions--a maxen t distribution on the positive line with known

means--but they cannot be symmetric. Although the current analysis

MAXENT ROC 23

was introduced with the assumption of symmetry of signal types, s u c h

symmetry makes a stronger demand on the state of nature: The signal

distribution must be the square root of a measure-preserving

transformation of the noise distribution. A translation composed with a

reflection is a simple example. If a distribution is symmetric, as the case

for the Gaussian distributions in Figure 1, the reflection will g o

unnot iced.

Form vs. substance. The present analysis is formal, asking what t h e

least constraining ROC curves are under a variety of assumptions. If t h e

stimuli are multidimensional, then the central limit theorem indicates

that the sum of values on each of the dimensions will approach t h e

normal. If decisions are based on the sums or extrema or convolutions

of multidimensional features, then the simple predictions fo r

unidimensional stimuli evolve toward more interesting ones. If, f o r

instance, an observer attends to the most distinctive feature on one o f

several dimensions (or to the least distinctive feature) the discriminal

processes will be extreme value distributions (Killeen, 2001). The difference

of two extreme value processes has a logistic distribution, which is t h e

form assumed by Luce (1959; 1963) in his theory of signal detection. It is

the discriminal process that is predicted if the observer is compar ing

two stimuli on the basis of most or least distinctive features, as t h e y

might in a police l ineup.

In the case of recognition memory, comparison of various theoretical

models is economically effected with ROC curves (Lockhart & Murdock, 1970).

Some models of recognition memory r e q uire constancy of slope, o the r s

MAXENT ROC 24

that it take a value of 1.0, and yet others that it be correlated with t h e

intercept. The empirical ROC curves support one or another theory as a

function of the methods used for manipulating detectability of o l d

versus new words. In some cases, conformance to the constraint for a

Weberian continuum (m = 1 - w I ) is good. Key references are (e.g., Clark &

Gronlund, 1996; Glanzer, Kim, Hilford & Adams, 1999; Ratcliff, McKoon & Tindall, 1994; Shiffrin &

Steyvers, 1997).

Extracting more Information from the Experiment

The maximum information that can be conveyed in the CSDT

paradigm is 1 bit, and this is achieved only when performance is

perfect. The data shown in Figures 8 and 9 required many sessions o f

experimentation to establish reliable estimates at several points in t h e

information space. It was inevitable that more efficient methods would

be soon devised. In recent reviews of SDT (Swets, 1986a; Swets, 1986b; Swets et al.,

2000), as well as analysis of recognition memory, almost all of the ROCs

were derived by the confidence rating technique.

The confidence rating ROC. If, instead of a binary response, subjects

are encouraged to qualify their decisions by rating how confident t h e y

are in their judgment, the matrices shown in Tables 1-3 expand to a

version of Table 5. In that table, positive responses are taken t o

designate stimulus B, with confidence greatest for responses of n a n d

decreasing to 1. Negative responses are taken to designate Stimulus A,

with confidence in that designation increasing with absolute value o f

the response.

MAXENT ROC 25

+++ Insert Table 5 around here +++

In constructing a scale from such data it is assumed that t h e

observer partitions the decision axis--the x-axis of Figure 7 (or some

monotone function of it). Assume that n = 5. Then a rating of -5 is close

to the origin: The magnitude of the event that was perceived was small--

or the depth of the quiet that was sensed was great. If a payoff schedule

convinced an observer to place the criterion between the -5t h and -4t h

categories, then all events of magnitude below that criterion would b e

called “A”, and contribute to the calculation of correct rejections a n d

misses. Events of magnitude falling above that criterion would be called

“B,” and all of the columns from -4 through +5 would be pooled, t o

contribute to the calculation of hits and false alarms. Knowing whe the r

a stimulus was present or not, we could calculate the true positives a n d

false positives around that criterion. The procedure is continued b y

successively aggregating on either side of the criteria separating each o f

the columns of Table 5 .

Representative early data were published by Emmerich (1968), w h o

had observers move a slider over a range o f 13.5 inches. Locations t o

the left of a central mark indicated degree of certainty that no signal

had been presented, and ones to the right indicated degree of cer ta inty

that a signal had been presented. Emmerich aggregated the data into n

= 10 confidence ratings on either side of the center. One of the d o z e n

data sets he presented is displayed as Figure 12. There is a deviation

from these power curves evident as an over-prediction for the lower

MAXENT ROC 26

parts of the curves. This deviation was also present in other figures, a n d

equally evident in fits of unequal-variance normal distributions. It m a y

be due to lack of symmetry in the response modality: A response 4

inches to the left of center (moderate confidence n o) might not h a v e

meant the same as a responses 4 i nches to the right of the cen te r

(moderate confidence yes). There is, none-the-less, a fair amount o f

order in the data. The inset shows that the logarithm of β is a linear

function of the logarithm of the signal-to-noise ratio, as was the case fo r

Norman’s pedestal experiment (Figure 11; the use of energy rather t h a n

voltage/amplitude merely changes the scale on the x-axis).

+++ Insert Figure 12 around here +++

Signal detection as category scaling. The ROC is a graph of o n e

cumulative probability d i s t ribution as a function of another, as a n

argument (here, the criterion) ranges from the highest limits to t h e

lowest. This display is also called a Q-Q plot (a quantile-quantile plot;

Chambers, Cleveland, Kleiner & Tukey, 1983). Statistics are available fo r

specific hypotheses about ROCs (e.g., Metz & Kronman, 1980); Smith,

Thompson, Engelgau & Herman, (1996) provide a generalized l inear

model and references to other models.

Symmetric densities that differ only in their means genera te

symmetric ROCs. The information displayed consists of probabilities,

not measures of physical or psychological spaces. Yet the derivation o f

measures such as d’ from probabilities assumes an underlying interval-

scale (the z-score map to probabilities) for the random variables.

MAXENT ROC 27

Observers can transmit the maximum information if categories are u s e d

accurately and with equal frequency; they attempt to achieve both these

goals, with performance usually a compromise between them (Parducci,

1983). On Weberian continua, equal use of categories requires t h a t

category widths be an increasing function of the magnitude of t h e

stimuli. Plotted against the magnitude of the stimuli, optimal category

scales should be concave, as they are (Eisler & Montgomery, 1974;

Marks, 1974; Stevens & Galanter, 1957). If the categories are imposed

without respecting the nonlinearity of the category scale, for example

by dictating either assignments or anchors--such as centering t h e

confidence rating on the equiprobable point--then the category scale

will be warped and information transmission decreased (Killeen, 1998;

Stevens, 1955). Rating-scale ROCs should permit the observers use of a n

unrestricted category scale, rather than one anchored at the midpoint .

Balakrishnan (1998) shows that the sum of the differences be tween

p(H) and p (F) for each of the categories provides a measure o f

sensitivity that is superior to traditional measures .

Entropy of the Signal

The judgments of Emmerich’s (1968) observers contained more than 1

bit of entropy. Had the observers used each of the 20 categories in t h a t

experiment with equal probability, the response distribution would

have 4.32 bits of entropy, the theoretical maximum information t h a t

they could transmit. The maximum achievable by such scaling is

probably less than this: Hake and Garner (1951) found that the channe l

MAXENT ROC 28

capacity for judgment of position of a marker on a line was 3.3 bits.

Since this is the best in the simple case of localizing a point on a line

(under these circumstances), use of a point on a line to measure o t h e r

variables is unlikely to be more informative. In any case, Emmerich’s

observers demonstrably had access to more than the 1 bit o f

information expended by the experimenter when the latter classified

the stimuli as either signal or noise. Where did this information come

from, and how is it lost from CSDT?

The entropy of a binary signal is a simple function of its probability

distribution (Equation 1). As the number of states of the signal increases

beyond 2, so also may the entropy of the signal. A continuous signal h a s

an infinity of possible states, and thus can potentially convey an infinite

amount of information. But the ability of an observer to process th i s

information is limited by a finite perceptual “grain”--the just-noticeable

difference, or j n d. Call the limiting grain size ∆x. Then the entropy of a

continuous distribution on the variable x is (Norwich, 1993):

7 .

The integral is known as the differential entropy. It is a function o f

characteristics of the signal. For an exponential distribution, f(x) = exp(-

x/µ ). The differential entropy of the exponential distribution with m e a n

µS is log2( e µS). The rightmost “limit” in Equation 7 represents t h e

increasing information that becomes available as the grain-size is

reduced. It grows without bound as ∆x goes to zero. Call this componen t

g(∆) .

U = – f(x) log2[f(x)] dx0

∞

+ lim∆x → 0

log2[1/∆x]

MAXENT ROC 29

On any trial the observer is p r e sented with a stimulus from a

mixture of signal and noise distributions--an average of those shown i n

Figure 7. On such a trial we may ask how much entropy is derived f r o m

the signal, over and above that from the noise. The difference i n

entropies of signal and noise is:

8 .

Thus, as long as the signal and noise densities are parsed with t h e

same grain size, then the component of information due to the gra in

cancels out, leaving a measure of the maximum information available

from the signal in this mixture .

Digitization Loss

When an experimenter classifies an event into categories, h e

performs the same task as the observer in his experiment: He is

mapping a complex stimulus onto a binary representation (e.g., t h e

nomination “S” vs. “N”). The only difference is that the exper imenter

has an additional source of information (the position of a switch, or t h e

printout of a computer). But it is not the position of a switch that is

presented to the observer: It is the raw stimulus that the exper imenter

categorized as S or N, and which now the observer must categorize. T h e

stimulus may be as simple as the presence of a tone in a background o f

noise, or as complex as the presence of a tumor in an x-ray. Both t h e

experimenter and the observer are part of an information sys tem

US –UN = log2[eµS] + g(∆) – log2[eµN] + g(∆)

= log2[µS/µN]

MAXENT ROC 30

(Figure 13). There are three sources of entropy: the experimenter, t h e

stimulus, and the observer. The transmission between the stimulus a n d

a binary experimenter is at most 1 bit. The transmission between t h e

stimulus and the observer may be more than 1 bit, as seen in the ra t ing

scale paradigm. By limiting his nomination of the stimulus to 2 states,

however, the experimenter limits information transmission between

himself and the observer to 1 bit.

The stimulus, whether a masked tone, a PET scan, or a n

encyclopedia, may have arbitrarily large entropy. It is the three-way

interaction of the signaler, the signal, and the receiver that consti tutes

the proper object of analysis. Attneave (1959) and McGill (1954) p rovide

the models. The three way interaction may occur even where t h e r e

appears to be fewer entities, as in a conversation: A speakers’ face m a y

be an open book to his listener, even though closed to himself. To

“know oneself” an observer must map a most complex stimulus onto a

reduced vocabulary in a manner that conveys information to the self a s

listener.

+++ Insert Figure 13 (icon) around here +++

The quantization loss incurred by using less than the max imum

capacity of a signal to transmit information was graphed by Harmon

(1963), and is reproduced as Figure 14. If all the experimenter knows is a

binary state, 1 bit is the best that she can do. Knowing the m e a n

stimulus value she might estimate the differential entropy of t h e

stimulus. But the experimenter may have limited control of the noise,

MAXENT ROC 31

components of which may be internal to the observer, or limited control

of the sample drawn by the observer on any one trial from the s t imulus

population. Thus it is often the experimenter, not the observer, t h a t

forms the bottleneck in the communication channel .

+++ Insert Figure 14 (Harmon) around here +++

Theory of the Ideal Observer

The ideal observer utilizes all of the information available in a signal

to make a decision. If the r a n d om variable is voltage or amplitude of a

signal, then its variance, σ2, corresponds to the power of the signal. T h e

distribution with known average power and maximum randomness is a

Gaussian, with a differential entropy of log2[ 2 πeσ2]/2 (Shannon, 1949). If

more than one frequency is involved, the maxent signal has Gaussian

amplitude at each frequency, and is called white noise. In a channel

with noise added to the signal, the net entropy of the signal is log2[ (σS +

σN)/ σN] = log2[1 + S/ N]. If observers are sensing differences i n

amplitude, then this equation specifies the most entropy that they c a n

convert into information. This is the reason for the ubiquitous use o f

the logarithm of the signal-to-noise ratio as a common x-axis i n

psychoacoustics (the decibel), and of white noise as a masker. T h e

decibel is compared to other measures by Grantham (1982).

What is the observer observing? Even in the case of simply detect ing

a tone in a background of other tones, or of white noise, it is not easy t o

say just what an observer is basing his or her decision upon--that is, t h e

nature of the decision axis corresponding to the x-axis in Figure 7. A

MAXENT ROC 32

Gaussian distribution of signal voltages or sound pressures is o f ten

assumed as the discriminal process (Figure 1) because it corresponds t o

the signal that, given specification of average power (σ2), is m o s t

random, and which therefore has the potential to convey the m o s t

information. The addition of one Gaussian random variable to another--

their convolution--yields another Gaussian--the signal plus noise

distribution, resulting in the icon of CSDT shown4 in Figure 1 .

But what if observers are detecting differences in power--the

amplitude squared? Then the resulting discriminal processes--the

convolution of the squares of Gaussian processes--are chi-square (Χ2)

distributions. The degrees of f reedom of the Χ2 correspond to t h e

number of normal processes that are involved. If these component ia l

distributions have different means, then the resulting non-central chi-

square distribution is known as the Rayleigh-Rice distribution (Evans,

Hastings & Peacock, 1993). Laming (1986) has made a good case for the Χ2 as t h e

fundamental discriminal process for energetic stimuli, as has Jeffress

(1964) for auditory stimuli. With a large number of degrees of f reedom

(or a large non-centrality parameter), Χ2 processes are essentially

normal, looking much like those pictured in Figure 1. In the case of a

small number of degrees of freedom, they are approximately no rma l

when plotted on a logarithmic axis.

Finally, consider the Weibull distribution, 1 - exp[(-x/µ )γ]. For γ = 1

its density is the exponential process of Figure 7; for γ = 2 it is t h e

4 Figure 1, although representative of this scenario, is euphemistic. For the distribution on the right to besignal plus noise, its variance must be larger than the one on the left by the rms of their variances. Thus,asymmetric ROCs are entailed by this standard assumption.

MAXENT ROC 33

Rayleigh density; and for γ = 3 it is a symmetric distribution t h a t

resembles the normal. It has been used as a model for the psychometr ic

function (Fay & Coombs, 1992; Nachmias, 1981; Relkin & Pelli, 1987). For all values of γ,

the ROC of the Weibull is given by Equation 5 with β raised to the power

γ. Thus the ROC remains a power function. It is clear that the shape o f

the operating characteristic is only weakly constrained by assumptions

about the nature of the underlying discriminal processes.

Channel Capacity for Weberian Stimuli

The derivation of Equation 8 from Equation 7 treated noise a s

perturbing the signal and thus undermining the information that can b e

extracted from it. This is a pragmatic assumption because it cancels o u t

the infinite term in the integration, reducing the differential entropy t o

an absolute one. The origin of the noise may reside not so much in t h e

signal, but in the ruler against which the signal is measured: Each n e w

signal may perturb psychological representations by a m o u n t s

proportional to the signal’s magnitude. With stimuli presented i n

random order, the resulting per tu rba t ions will be approximately

Gaussian (Killeen, 1991). It is known from the empirical literature that t h e

standard deviations of the perturbations are approximately

proportional to the mean of the stimuli, σ = wµ, where w is the Weber

coefficient. Killeen and Taylor (2000) provide one mechanism for s u c h

proportional error. If the signal is also assumed to have a Gaussian

distribution, then the maximum information available from the signal is

the difference in entropy of signal and noise:

MAXENT ROC 34

9 .

Weber coefficients usually assume values between 0.25 and 0.05 fo r

various sensory continuua, for which the channel capacities a r e

predicted to be between 2 and 4.3 bits, consistent with data on absolu t e

judgments. The same result follows if both signal and noise are a s sumed

exponential. In the case that the signal is exponentially distributed a n d

the noise Gaussian, the channel capacity will be lower by log2[2π] / 2 ,

approximately 1.3 bits.

Summary and Conclusions

Observers in SDT tasks are conveying information concerning a

stimulus to an experimenter. For the binary detection task, t h e

information is limited by the binary categorization of the experimenter .

The entropy of the stimulus may be greater than 1 bit, and this may b e

revealed by subjects who are given the opportunity to rate t h e

likelihood that a stimulus was present. If the underlying distributions o f

perceptions are symmetric and congruent, the observer will be able t o

maintain constant information transmission while shifting bias to favor

one or the other response category. The resulting iso-informative

operating characteristics are approximately arcs of a circle.

US –UN = log2[2πeµS2]/2 + g(∆) – log2[2πew2µS

2]/2 + g(∆)

= – log2[w]

MAXENT ROC 35

Often subjects are unable to m a i n tain constancy of informat ion

transmission while shifting bias, implying anisotropy in the under ly ing

continuum. The discriminal process that makes the least assumpt ions

beyond an estimate of its mean is the exponential. The opera t ing

characteristics that result from this assumption are power functions, a s

are those resulting from the generalization of the exponential known a s

the Weibull distribution. Power-law operating characteristics a r e

characterized by the parameter β, the inferred signal-to-noise rat io,

which is closely related to d’.

Most contemporary ROCs are constructed using the confidence

rating technique. Inferences concerning discriminations are thus ba sed

on category scales, often without recognition of th i s fact, or wi thout

optimizing performance based on the scaling l i terature.

The entropy of a stimulus, and thus the maximum informat ion

available to the observer, may vary as a function of its mean. T h e

measure of entropy depends on the number of states of the r a n d o m

variable, and for continuous distributions can be infinite. This p rob lem

of infinitesimal grain size is avoided if discussion is limited to n e t

entropy, with another source of entropy, such as noise, canceling o u t

the infinitesimal. For both exponential and normal discriminal

processes, the maximum information available from a signal is

proportional to the logarithm of the signal-to-noise rat io.

On the line, entropy is determined up to an arbitrary grain size; i n

this wise, it reflects our finite ability to resolve differences, whether in a

physical measurement or a psychological one. If the grain size is

MAXENT ROC 36

determined by the noise attendant on measurement, and this is

proportional to signal magnitude--as is the case for continua on which

Weber’s l aw holds--then the channel capacity for a continuum is a

function of its Weber fraction alone.

MAXENT ROC 37

References

Attneave, F. (1959). Applications of information theory to psychology. New

York: Holt, Rineheart, and Winston.

Blough, D. S. (1967). Stimulus detection as signal detection in pigeoons.

Science, 1 5 8, 940-941.

Brooks, R. (1991). Intelligence without representation. Artificial Intelligence, 47, 139-

159 .

Clark, S. E., & Gronlund, S. D. (1996). Global matching models o f

recognition m e m ory: How the models match the data. Psychonomic Bulletin &

Review, 3 , 37-60.

Egan, J. P. (1975). Signal detection theory and ROC analysis. New York:

Academic Press.

Eisler, H., & Montgomery, H. (1974). On theoretical and realizable ideal

conditions in psychophysics: Magnitude and category scales and their relation.

Perception & Psychophysics, 1 5, 157-168.

Emmerich, D. S. (1968). Receiver-operating characteristics determined u n d e r

several interaural conditions of listening. Journal of the Acoustical Society o f

America, 4 3, 298-307.

Evans, M., Hastings, N., & Peacock, B. (1993). Statistical Distributions (2nd ed.) .

NY: Wiley.

Fay, R. R., & Coombs, S. L. (1992). Psychometric functions for level

discrimination and the effects of signal duration in the goldfish (Carassius

au ra tu s): Psychophysics and neurophysiology. Journal of the Acoustical Society

of America, 9 2, 189-201.

Glanzer, M., Kim, K., Hilford, A., & Adams, J. K. (1999). Slope of t h e

receiver-operating characteristic in recognition memory. Journal o f

Experimental Psychology: Learning, Memory, and Cognition, 2 5, 500-513.

Grantham, D. W., & Yost, W. A. (1982). Measures of intensi ty

discrimination. Journal of the Acoustical Society of America, 7 2, 406-410.

Green, D. M., & Swets, J. A. (1966). Signal detection theory a n d

psychophysics. New York: Wiley.

MAXENT ROC 38

Grier, J. B. (1971). Nonparametric indices for sensitivity and bias:

Computing formulas. Psychological Bulletin, 7 5, 424-429.

Hake, H. W., & Garner, W. R. (1951). The effect of presenting various

numbers of discrete steps on scale reading accuracy. Journal of Experimental

Psychology, 4 2, 358-366.

Harmon, W. W. (1963). Principles of the statistical theory o f

communicat ion. New York: McGraw-Hill.

Jaynes, E. T. (1979). Where do we stand on maximum entropy? In R. D.

Levine & M. Tribus (Eds.), The maximum entropy formalism, (Vol. 15-118, ) .

Cambridge, MA: Masachussetts Institute of Technology.

Jaynes, E. T. (1986). Bayesian methods: An introductory tutorial. In J. H.

Justice (Ed.), Maximum entropy and Bayesian methods in applied statistics, .

Cambridge: Cambridge University Press.

Jeffress, L. A. (1964). Stimulus-oriented approach to detection. Journal o f

the Acoustical Society of America, 3 6, 766-774.

Jeffress, L. A. (1970). Masking. In J. V. Tobias (Ed.), Foundations of m o d e r n

auditory theory, (pp. 87-114). New York: Academic Press.

Juslin, P., & Olsson, H. (1997). Thurstonian and Brunswikian origins o f

uncertainty in judgment: A sampling model of confidence in sensory

discrimination. Psychological Review, 1 0 4, 344-366.

Kapur, J. N. (1989). Maximum entropy models in science and engineering. New

York: Wiley.

Killeen, P. R. (1991). Paying attention limits channel capacity. In G. R.

Lockhead (Ed.), Fechner Day 91; Proceedings from the Seventh Annual Meeting

of the International Society for Psychophysics, (pp. 27-32). Durham, NC, USA.

Killeen, P. R. (1998). Fechner's magic. In S. Grondin & Y. Lacouture (Eds.),

Fourteenth Annual Meeting of the International Society for Psychophysics, ( p p .

1-9). Quebec, Canada: Université Laval.

Killeen, P. R. (2001). Writing and overwriting short-term memory .

Psychonomic Bulletin & Review, in p ress.

Killeen, P. R., & Taylor, T. J. (2000). How the propagation of error t h r o u g h

stochastic counters affects time discrimination and other psychophysical

judgments. Psychological Review, 1 0 7, 430-459.

MAXENT ROC 39

Kornbrot, D. E., Donnelly, M., & Galanter, E. (1981). Estimates of util i ty

function parameters fom signal detection experiments. Journal of Experimental

Psychology: Human Perception a nd Performance, 7 , 441-458.

Laming, D. (1986). Sensory Analysis. New York: Academic Press.

Lee, W. (1969). Relationship between Thurstone category scaling and signal

detection theory. Psychological Bulletin, 7 1, 101-107.

Lockhart, R. S., & Murdock, B. B. (1970). Memory and the theory of signal

detection. Psychological Bulletin, 7 4, 100-109.

Luce, R. D. (1959). Individual choice behavior. New York: Wiley.

Luce, R. D. (1963). Detection and recognition. In R. D. Luce, R. R. Bush, & E.

Galanter (Eds.), Handbook of Mathematical Psychology, (Vol. 1, pp. 103-189) .

New York: Wiley.

Luce, R. D. (1977). Thurstone's discriminal processes fifty years later .

Psychometrika, 42, 461-489.

Luce, R. D. (1994). Thurstone and sensory scaling: Then and now.

Psychological Review, 1 0 1, 271-277.

Macmillan, N. A., & Creelman, C. D. (1991). Detection theory: A user's

gu ide. Cambridge, England: Cambridge University Press.

Macmillan, N. A., & Creelman, C. D. (1996). Triangles in ROC space: History

and theory of "nonparametric" measures of sensitivity and bias. Psychonomic

Bulletin & Review, 3 , 164-170.

Markowitz, J., & Swets, J. A. (1967). Factors affecting the slope of empirical

ROC curves: comparison of binary and rating responses. Perception &

Psychophysics, 2 , 91-100.

Marks, L. E. (1974). On scales of sensation: Prolegomena to any fu tu re

psychophysics that will be able to come forth as science. Perception &

Psychophysics, 1 6, 358-376.

McGill, W. J. (1954). Multivariate information transmission. Psychometrika,

1 9, 97-116.

Metz, C. E., & Kronman, H. B. (1980). Statistical significance tests fo r

binormal ROC curves. Journal of Mathematical Psychology, 2 2, 218-243.

MAXENT ROC 40

Nachmias, J. (1981). On the psychometric function for contrast detection.

Vision Research, 2 1, 215-223.

Norman, D. A. (1963). Sensory thresholds and response bias. Journal of t h e

Acoustical Society of America, 3 5, 1432-1441.

Norwich, K. W. (1993). Information, sensation, and perception. New York:

Academic Press.

Parducci, A. (1983). Category ratings and the relat ional character o f

judgment. In H. G. Geissler (Ed.), Modern issues in perception, (Vol. 11, p p .

262-282). Amsterdam: Elsevier.

Pollack, I., & Norman, D. A. (1964). A nonparametric analysis of recognit ion

experiments. Psychonomic Science, 1 , 125-126.

Ratcliff, R., McKoon, G., & Tindall, M. (1994). Empirical generality of d a t a

from recognition memory receiver-operating characteristic functions a n d

implications for the global memory models. Journal of Experimental

Psychology: Learning, Memory, and Cognition, 2 0, 763-785.

Relkin, E. M., & Pelli, D. G. (1987). Probe tone thresholds in the aud i to ry

nerve measured by two-interval forced-choice. Journal of the Acoustical Society

of America, 8 2, 1679-1691.

Shannon, C. E. (1949). A mathematical theory of communication. In C. E.

Shannon & W. Weaver (Eds.), The mathematical theory of communicat ion.

Urbana, IL: University of Illinois Press.

Shiffrin, R. M., & Steyvers, M. (1997). Model for recognition memory: REM -

Retrieving effectively from memory. Psychonomic Bulletin & Review, 4 , 145-

166 .

Skilling, J. (1989). Maximum entropy and Bayesian methods. Dordrecht ,

The Netherlands: Kluwer Academic Publishers.

Smith, P. J., Thompson, T. J., Engelgau, M. M., & Herman, W. H. (1996). A

generalized linear model for analysing receiver operating characteristic curves.

Statistics in Medicine, 1 5, 323-333.

Stevens, S. S. (1955). On the averaging of data. Science, 1 2 1, 113-116.

Stevens, S. S., & Galanter, E. (1957). Ratio scales and category scales for a

dozen perceptual con t inuua. Journal of Experimental Psychology, 5 4, 377-

411 .

MAXENT ROC 41

Swets, J. A. (1986a). Form of empirical ROCs in discrimination and diagnostic

tasks: Implications for theory and measurement of performance. Psychological

Bulletin, 9 9, 181-198.

Swets, J. A. (1986b). I n dices of discrimination or diagnostic accuracy: Their

ROCs and implied models. Psychological Bulletin, 9 9, 100-117.

Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Psychological science c a n

improve diagnostic decisions. Psychological Science in the Public Interest, 1 , 1 -

2 6 .

Swets, J. A., Tanner, W. P., & Birdsall, T. G. (1964). Decision processes i n

perception. In J. A. Swets (Ed.), Signal detection and recognition by h u m a n

observers, (pp. 3-57). New York: John Wiley & Sons, Inc.

Tribus, M. (1969). Rational descriptions, decisions and designs. New York:

Pergamon.

Tribus, M. (1979). Thirty years of information theory. In R. D. Levine & M.

Tribus (Eds.), The maximum entropy formalism, (pp. 1-14). Cambridge, MA:

Masachussetts Institute of Technology.

MAXENT ROC 42

Table 1. The joint probabilities from a discrimination exper iment

Response:Stimulus

Ra Rb

Sa . 375 .125 p(Sa) = . 5 0Sb . 175 .325 p(Sb) = . 5 0

p(Ra) = .55 p(Rb) = . 4 5 1 . 0

Table 2. Stimulus-conditional probabilities --p(Ri|Sj)--derived from Table1

Response:Stimulus

Ra Rb

Sa p(CR) = . 750 p(F) = . 250 1 .0Sb p(M) = . 350 p(H) = . 650 1 .0

Table 3. Response-conditional probabilities --p(Si|Rj)--derived from Table1

Response:Stimulus

Ra Rb

Sa p(CR’) = . 682 p(F’) = .278Sb p(M’) = . 318 p(H’) = . 722

1 . 0 1 . 0

Table 4. The payoff matrix.

Response:

Stimulus

Ra Rb

Sa v(CR) v(F)

Sb v(M) v(H)

MAXENT ROC 43

Table 5. The rating scale ROC data matrix.

Response:Stimulus

- n … - 3 - 2 - 1 + 1 + 2 + 3 … + n

Sa

Sb

MAXENT ROC 44

Figure 1 . The machinery of CSDT. The discriminal processes a r e

Gaussian densities representing the probability that a stimulus will give

rise to a perceptual event of a particular magnitude. The observer says

“B” whenever a percept exceeds a criterion C, represented by the vertical

line. Normally an Sb stimulus gives rise to a percept that falls above t h e

criterion, and the affirmative response is called a hit (H). Sometimes an Sa

stimulus gives rise to a percept that exceeds the criterion, and t h e

affirmative response is then called a false a l a rm (F). The discriminability

of two stimuli is given by the difference of their z-scores, as inferred f r o m

the accuracy of their performance. In particular, d’ = -[z(H) - z(F)].

250 500 750 1000 1250

Pro

babi

lity

Den

sity

Stimulus Value

Percept Value

Sa S

b

1

Pa

Pb

C

MAXENT ROC 45

Figure 2 . The data from Table 1 plotted as a filled circle in the u n i t

square. The open circle is derived by assuming symmetry of signals. T h e

curves are drawn through points that conserve the informat ion

transmitted by the observer.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

T = 0.50 T = 0.12

Isoinformation ROCs

MAXENT ROC 46

Figure 3 . The relation between transinformation and detectabili ty

measured as d’.

0.001

0.01

0.1

1

0.1 1

Tra

nsin

form

atio

n (b

its)

d'

T ≈ 0.1d'2

MAXENT ROC 47

Figure 4 . Conservation of information accomplished by a shift

between the expected equivocation (information loss) from a yes

response, p[y]U[S|y] and that from a n o response p [n]U[S|n]. The small

filled circles show the data from Figure 2; the unfilled circles show t h e

data from Figure 5 in the condition where the signal probability was h e l d

constant .

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Ave

rage

Ent

ropy

| ye

s

Average Entropy | no

Complementarity of EntropyExpected from Responses

MAXENT ROC 48

Figure 5 . Data reported by Green and Swets for an observer b iased

by varying the signal presentation probability (squares; their Figure 4-1)

and by varying the payoffs (circles; their Figure 4-2). The curve is t h e

isoinformation ROC, which is not visually discriminable from that fo r

constant A’ (also drawn through the points) .

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

T = 0.05

IsoinformationROC

Green & Swets (1966)

MAXENT ROC 49

Figure 6 . Data reported by Green and Swets (their Figure 4-5) fo r

another observer biased by varying the signal presentation probabil i ty

(the same condition as shown by the squares in Figure 5). The cont inuous

curve is the isoinformation ROC. The dashed curve is given by Equation

5 .

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

Green & Swets (1966)

MAXENT ROC 50

Figure 7 . Maxent exponential distributions of two random variables

on the positive line in the continuous case. The variable with the smaller

mean is called Noise, and the other Signal.

Signal

0 5 10 15 20 25 30

Stimulus

Pro

babi

lity

Noise

MAXENT ROC 51

Figure 8 . Data from 4 observers detecting brief flashes of light, f r o m

(Swets, Tanner & Birdsall, 1964). Performance was manipulated b y

varying the payoff matrices.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

3

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

4

Swets, Tanner & Birdsall (1964)

MAXENT ROC 52

Figure 9 . Data from one observer detecting brief increments in t h e

intensity of tones (Norman, 1963). Each panel shows the data for a

different signal-to-noise ratio. Bias was varied with differential payoffs.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

A: ∆v/v = 0.017 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

B: ∆v/v = 0.019

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

C: ∆v/v = 0.022 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

D: ∆v/v = 0.023

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

E: ∆v/v = 0.0290

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(H

)

p(F)

F: ∆v/v = 0.033

Norman (1963)

MAXENT ROC 53

Figure 1 0 . The equivalence of the indices of merit for logistic SDT

and power ROCs. The x-axis is the parameter β.

0

1

2

3

4

5

6

0 1 2 3 4 5 6

Two indices of meritcompared

2*ln

[H/(

1-H

)] ≈

d'2

log2(S/N)

MAXENT ROC 54

Figure 1 1 . Beta, the ratio of the exponential means inferred f r o m

power functions fit to the data of 2 observers, versus the relative

increment in signal voltage associated with them. Data from (Norman,

1 9 6 3 ).

1

10

1.02 1.03 1.04 1.05

Parameter of IOCsas a Function of

Relative Signal Amplitude

µ S/µ

N

(∆v+v)/v

MAXENT ROC 55

Figure 1 2 . Rating scale operating characteristics. The inset shows

the parameter of the power function against the signal-to-noise ratio i n

dB. Each datum along a curve is obtained by re-aggregating the d a t a

around successive ratings, as though different ratings corresponded t o

different criteria.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p(r|

S)

p(r|N)

Emmerich (1968)

1

10

6 8 10 12

µ S/µ

N

10 log(E/N0)

MAXENT ROC 56

Figure 1 3 . The entropy of stimuli such as an encyclopedia or a

medica l measurement may be indefini tely large. Information

transmission is limited by the variable with the smallest entropy. This

may be the stimulus, the experimenter or the observer. If t h e

experimenter imposes a binary classification, the maximum informat ion

that may transmitted between observer and experimenter is 1 bit, even

though the observer may be able to make finer discriminations.

MAXENT ROC 57

Figure 1 4. Digitization loss increases with the relative entropy of t h e

signal. C is channel capacity for a continuous Gaussian signal, and w h e n

encoding is restricted to binary signals; W is the bandwidth of t h e

signal, and S/ N is the signal-to-noise ratio. Reprinted from (Harmon,1 9 6 3 ), with permission.

Documents

Maximum-Entropy Psychophysics