Uses of Information Theory inUses of Information Theory in

Uses of Information Theory inUses of Information Theory in Medical Imaging

Norbert Schuff, Ph.D.Center for Imaging of Neurodegenerative Diseases

[email protected]

Measurement & InformationMeasurement & Information

ObjectiveObjective

L h t tif i f tiLearn how to quantify information

TopicsTopicsI. Measurement of informationeasu e e o o a oII. Image registration using information theoryIII Imaging feature selection using informationIII. Imaging feature selection using information

theoryIV. Image classification based on informationIV. Image classification based on information

theoretic measures

Information and UncertaintyRandom Generator Which char comes

next?Decrease in Uncertainty

log(1) 0=

log(3) log(1/ 3) log(3)= −

log(1/ 2) log(2)= −g( ) g( )

log(3) log(2) log(6)− − = −Both combined:

Note, we assume each symbol is likely to appear at equal chance

More On UncertaintyMore On UncertaintyRandom

GGeneratorOf M Symbols

Which char comes next?

Some symbols may appear more

( )

( )

2logi i

M

p p−

∑appear more frequent than others

Entropy: H

( )21

: logShannon i ii

M

Many H p p=

= −∑

∑py

11i

ip

=

=∑

Example: Entropy Of mRNAExample: Entropy Of mRNA…ACGTAACACCGCACCTG…ACGTAACACCGCACCTG

1 1 1 1

Assume “A” occurs 50% of all instances, “C” = 25%, “G” and “T”, each = 12.5%

1 1 1 1; ; ;2 4 8 8A C G Tp p p p= = = =

( )2.. log : 1; 2; 3; 3i A C G TLogodds p p p p p− = = = =

1 1 1 1( ) 1 2 3 3 1.752 4 8 8

Entropy bits H = ⋅ + ⋅ + ⋅ + ⋅ =2 4 8 8

Concept Of EntropyConcept Of Entropy

•Shannon Entropy formulated by Claude Shannon•Shannon Entropy formulated by Claude Shannon

•American mathematician•Founded information theory with one landmark paper in 1948•Worked with Alan Turin on cryptography during WWII

•History of Entropy•1854 – Rudolf Clausius German physicist developed theory of heat

Claude Shannon 1916-2001

1854 Rudolf Clausius, German physicist, developed theory of heat•1876 – Williard Gibbs, American physicist, used it for theory of energy•1877 – Ludwig Boltzmann, Austrian physicist, formulated theory of thermodynamics•1879 – Gibbs re-formulated entropy in terms of statistical mechanics1948 Sh•1948 - Shannon

Three Interpretations of EntropyThree Interpretations of Entropy• The uncertainty in the outcome of an event

– Systems with frequent event have less entropy than systems with infrequent events.

• The amount of surprise (information)• The amount of surprise (information)– An infrequently occurring event is a greater surprise, i.e.

carries more information and thus has higher entropy, than a frequently occurring event

• The dispersion in the probability distributionA if i h l di hi t d th– An uniform image has a less disperse histogram and thus lower entropy than a heterogeneous image.

Generalized EntropyGeneralized Entropy• The following generating function can be

used as an abstract definition of entropy:used as an abstract definition of entropy:

⎟⎞

⎜⎛ ∑M pv )(ϕ

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

⋅

⋅=

∑∑

=

=M

i ii

i ii

pv

pvhPH

1 2

1 1

)(

)()(

ϕ

ϕ

• Various definitions of these parameters provide different definitions of entropyprovide different definitions of entropy.– Found over 20 definitions of entropy

Various Formulations Of EntropyVarious Formulations Of EntropyNamesShannonShannonRenyiAczelAczelAczelAczelVarmaVarmaKapurHadraArimotoSharmaSharma

Various Formulations Of Entropy IIVarious Formulations Of Entropy IINamesTanejaTanejaSharmaSharmaFerreri

Sant'Anna

Sant'AnnaPicardPicardPi dPicardPicardPicard

Entropy In Medical Imagingpy g gEntropy Map HistogramAveraging

H=6.0 bitsN=1

H= 4.0N=40 H 4.0N=40

Histogramsimage brain only

H= 3.9H 10 1image brain only

Co-variance map H= 10.1

Entropy Of Noisy ImagesEntropy Of Noisy ImagesEntropy

MapSignal-to-Noise Level

SNR=20

opySNR=2

Ent

ro

Decreasing SNR LevelsSNR=0.2

Decreasing SNR Levels

Image Registration Using Information Theory

ObjectiveObjective

L h t I f ti Th fLearn how to use Information Theory for imaging registration and similar

dprocedures

Image RegistrationImage Registration• Define a transform T that maps one image ontoDefine a transform T that maps one image onto

another image such that some measure of overlap is maximized .– Discuss information theory as means for generating

measures to be maximized over sets of transforms

MR

I

CT MRI CT

Entropy In Image RegistrationEntropy In Image Registration• Define estimate of joint probability

distribution of images: A Bg

– 2-D histogram where each axis designates the number of possible intensity values in corresponding imageintensity values in corresponding image

– each histogram cell is incremented each time a pair (I 1(x,y), I 2(x,y)) occurs in p ( _ ( ,y), _ ( ,y))the pair of images (“co-occurrence”)

– if images are perfectly aligned then the histogram is highly focused; as the Joint 2D histogramhistogram is highly focused; as the images become mis-aligned the dispersion grows

Of images A and BB

– recall one interpretation of entropy is as a measure of histogram dispersion A

Joint Entropypy

• Joint entropy (entropy of 2-D histogram):py ( py g )

∑ 2,

( , ) ( , ) log [ ( , )]i i

i i i ix X y Y

H X Y p x y p x y∈ ∈

= − ⋅∑

• H(X,Y)=H(X)+H(Y) only if X and Y are completely independent.

• Image registration can be guided by minimizing joint entropy H(X,Y), i.e. dispersion in the joint histogram for images is minimizedimages is minimized

ExampleJoint Entropy of 2-D Histogram for rotation of image with respect

to itself of 0, 2, 5, and 10 degrees

Entire Rotation SeriesEntire Rotation Series

Rotation (degrees)0 2 5 10

Mutual InformationMutual Information

⎛ ⎞

,

( , )( , ) ( , ) log( ) ( )

i i

i ii i

x X y Y i i

p x yMI X Y p x yp x p y∈ ∈

⎛ ⎞= ⋅ ⎜ ⎟

⎝ ⎠∑ ∑

• Measures the dependence of the two distributions• MI(X,Y) is maximized when the images are co-registered

• In feature selection choose the features that minimizeIn feature selection choose the features that minimize MI(X,Y) to ensure they are not related

Use Of Mutual Information for Image R i t tiRegistration

• Alternative definition(s) of mutual information MI:• Alternative definition(s) of mutual information MI:MI(X,Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

• amount by which uncertainty of X is reduced if Y is known or vice versa.

MI(X,Y) = H(X) + H(Y) - H(X,Y)• maximizing MI(X,Y) is equivalent to minimizing joint entropy

H(X,Y)

• Advantage in using mutual information over joint entropy is that MI includes the entropy of each distribution separately.

• Mutual information works better than simply joint entropy in regions with low contrast where there will be high joint entropy but this is offset by high individual entropies as well - so the overall mutual information will be lowwell - so the overall mutual information will be low

• Mutual information is maximized for registered images

Properties of Mutual Informationp• MI is symmetric:

MI (X,Y) = MI(Y,X)

• MI(X,X) = H(X)• MI(X,Y) <= H(X)

– Information each image contains about the other i t b t th th t t l i f tiimage cannot be greater than the total information in each image.

• MI(X Y) >= 0MI(X,Y) >= 0– Cannot increase uncertainty in X by knowing Y

• MI(X Y) = 0 only if X and Y are independentMI(X,Y) 0 only if X and Y are independent

Processing Flow for Image Registration U i M t l I f tiUsing Mutual Information

Pre-processingInput Images

Mutual InfoProbability

D itEstimation

DensityEstimation

Image Transformation

OptimizationScheme

Output ImageTransformation Scheme Image

Relative EntropyRelative Entropy• Assume we know X then we seek the remaining

entropy of Yentropy of Y

( | )H Y X

( ) [ ]( | )

( | ) log ( | )i i

i i i i ix X y Y

p x p y x p y x∈ ∈

= − ⋅∑ ∑

( , )( , ) log( )

i iy

i ii i

p y xp x yp x

⎡ ⎤= − ⎢ ⎥

⎣ ⎦∑

, ( )i ix X y Y ip x∈ ∈ ⎣ ⎦∑

Also known as Kullback-Leibler divergence; An index of information gain

Measurement Of Similarity Between Di t ib tiDistributions

• Question: How close (in bits) is a distribution X to a model distribution Ω?

( )( || ) ( ) log ip xD X p x⎛ ⎞

Ω ⎜ ⎟∑( || ) ( ) log( )

i

iKL i

x i

D X p xq x∈Ω

Ω = ⋅ ⎜ ⎟⎝ ⎠

∑Relative Entropy

• DKL = 0 only if X and Ω are identical; otherwise DKL > 0D (X||Ω) i NOT l t D (Ω||X)

Kullback-Leibler Divergence

• DKL(X||Ω) is NOT equal to DKL(Ω||X)

• Symmetrical Form 1 ( )1( || ) ( || ) ( || )2sKLsD X D X D XΩ = Ω + Ω

Uses Of The Symmetrical KLUses Of The Symmetrical KLHow much more noisy is image A compared to image B?

Map Histograms sKL distance B from AA

sKL

B

Increasing noise level

B

Increasing noise level

Uses Of The Symmetrical KLUses Of The Symmetrical KLDetect outlier in a image series

#1 #2 ……… #5 #6 ………… #13

LsK

L

Series Number

Image Feature Selection Using Information Theory

Obj tiObjective

Learn how to use information theory for determining imaging featuresg g g

Mutual Information based Feature Selection Method

We test the ability to separate two classes• We test the ability to separate two classes (features).

f• Let X be the feature vector, e.g. co-occurances of intensities

• Y is the classification, e.g. particular i i iintensities

• How to maximize the separability of the classification?

Mutual Information based Feature Selection Method

• MI tests a feature’s ability to separate two y pclasses.– Based on definition 3) for mutual information

( , )( , ) ( , ) log( ) ( )

i i

i ii i

x X y Y i i

p x yMI X Y p x yp x p y∈ ∈

⎛ ⎞= ⋅ ⎜ ⎟

⎝ ⎠∑ ∑

– Here X is the feature vector and Y is the l ifi ti

( ) ( )i ix X y Y i ip p y∈ ∈ ⎝ ⎠

classification• Note that X is continuous while Y is discrete

– By maximizing MI(X,Y), we maximize the y g ( , ),separability of the feature

• Note this method only tests each feature individually

Joint Mutual Information based Feature Selection Method

• Joint MI tests a feature’s independence from all other• Joint MI tests a feature s independence from all other features:

• Two implementations:– 1) Compute all individual MI.s and sort from high to low

2) T t th j i t MI f t f t hil k i ll th– 2) Test the joint MI of current feature while keeping all others• Keep the features with the lowest JMI (implies independence)• Implement by selecting features that maximize:

Information Of Time SeriesInformation Of Time Series

Objective

Learn how to use information theory for quantifying functional brain signalsquantifying functional brain signals

How much information carry FMRI fluctuations?

A

B

( )a t

C ( )b t

( )time

( )c t

Conventional Analysis – Correlations, e.g.:

( )M

∑ <Corr> does not measure( ) ( )( ) ( ) ( )

1( , )1

i ii

a a b bcorr A B

M std a std b=

− ⋅ −=

− ⋅ ⋅

∑ <Corr> does not measure information, aka uncertainty.

Bits of information

( ) ( ) ( )1M std a std b

Quantification of Uncertainty in fMRI using Entropy

A

B

Probability distribution over block length L

C

Pr(s

L )

sliding block (L)

Block Entropy H(L): Uncertainty in identifyingUncertainty in identifying fluctuation patterns. H(L)

0 LBits of information

Measuring Difficulty in Overcoming Uncertainty

A

B

Probability distribution over block length L

C

Pr(s

L )

sliding block (L)

EETI

•Block Entropy H(L): uncertainty in identifying fluctuation patterns (e g

H(L)

uncertainty in identifying fluctuation patterns (e.g. patterns of block size L).

•Excess Entropy EE: intrinsic pattern memoryH(L)

0 L•Transient Information TI: ∑

LTI = EE-H(L)

Bits of information

Transient Information in fMRI

Image Classification Based On I f i Th i MInformation Theoretic Measures

Objective

Learn how to use information theory for classification of complex image patternsclassification of complex image patterns

Complexity In Medical ImagingComplexity In Medical ImagingComplexity

•In imaging, local correlations may introduce uncertainty in predicting intensity patterns

•Complexity describes the inherent difficulty in reducing uncertainty in predicting patternsg y p g p

•Need a metric to quantify complexity, i.e. reduce uncertainty in the distribution of observed patternsuncertainty in the distribution of observed patterns

Example: P Of C i l Thi iPatterns Of Cortical Thinning

Brain MRI Histological CutCortical ribbonCortical ribbon

Patterns Of Cortical Thinning In D tiDementia

Alzheimer’s Disease Frontotemporal DementiaAlzheimer s Disease Frontotemporal Dementia

Warmer colors represent more thinningWarmer colors represent more thinning

Information Theoretic Measures of Image C l itComplexity

Image Segment

Parse window size L over image

Information theoretic quantificationq

Three states: Zi =

Entropy: ( ) ( ),

, log ( ,Z Z

H P Z Z P Z Z′

′ ′= − ⋅∑

⎛ ⎞Statistical Complexity: ( ) ( )| log ( |

Z ZSC P Z Z P Z Z

′

⎛ ⎞′ ′= − ⋅⎜ ⎟⎝ ⎠

∑ ∑

Excess Entropy: ( )

L

EE H L= −∑

Summary Complexity MeasuresSummary Complexity Measures

E (H) b d if i f di ib i• Entropy (H) – measures number and uniformity of distributions over observed patterns (joint entropy). For example, a higher H represents increasing spatial correlations in image regions

• Statistical Complexity (SC) – quantifies the information, i.e. uncertainty , contained in the distribution of observed patterns. For example, a higher SC represents an increase in locally correlated patterns. p

• Excess Entropy (EE) – measures convergence rate of entropy. A higher EE represents increase long range correlations across regionsregions

Complexity Based Detection Of Cortical Thinning (Simulations)Thinning (Simulations)

H, SC, and EE automatically identify cortical thinning and do a good job at separating the groups for which the cortex was thinned from the group for which there was no thinning

Detection Of Cortical Thinning Pattern In Dementias Using Complexity MeasuresDementias Using Complexity Measures

Control

ADAD

RGB representation: H; SC; EEMore saturated red means more spatial correlations;

FTD

More saturated red means more spatial correlations;More saturation of blue means more locally correlated patterns;More saturated green means more long range spatial correlations;Simultaneous increase/decrease of H,EE,SC results in brighter/darker levels of gray

SummarySummaryMetric DescriptionEntropy Amount of uncertainty in observationspy yConditional entropy Uncertainty in observations given the

distribution of other observations`Joint entropy Uncertainty in observations across Jo t e t opy U ce ta ty obse at o s ac oss

multiple distributionsRelative entropy,aka Kullback-Leibler divergence

Similarity between distributions

Mutual Information The amount of overlap between distributions

Statistical complexity Uncertainty in linked conditional distributions

Excess entropy Convergence of entropyTransient information Difficulty in reducing uncertainty

LiteratureLiteratureAuthor: James P Sethna

Publisher: Oxford : Oxford University Press, 2006.

Author: H Haken

Publisher: Berlin ; New York : Springer, ©2006.

Documents

Uses of Information Theory inUses of Information Theory in