Lec14: Evaluation Framework for Medical Image Segmentation

MEDICAL IMAGE COMPUTING (CAP 5937)

LECTURE 14: Evaluation Framework for Medical Image Segmentation

Dr. Ulas BagciHEC 221, Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), Orlando, FL [email protected] or [email protected]

1SPRING 2017

Outline• How to evaluate accuracy of image segmentation?

– Gold standard ~ surrogate of truths– Qualitative

• Visual• Inter- and intra-observer agreement rates

– Quantitative• Volumetric measurements (regression)• Region overlaps• Shape based measurements• Theoretical comparisons• STAPLE, Uncertainty guidance, and evaluation w/o truths

2

Visual Assessment

3

Manual image segmentation from the full spectrum of IDEAL MRI data to delineate red: SAT, green: VAT, blue: liver, yellow: pancreas, purple: kidneys. Left to right: water- only, fat-only, in-phase, out-of-phase, fat fraction, and segmented labels from SliceOmatic.

Reference: Assessment of Abdominal Adiposity and Organ Fat with Magnetic Resonance Imaging (chp11).

Inherent Uncertainty

4

Comparison of glioblastoma multiforme (GBM) segmentation results on an axial slice: semi-automatic segmentation under Slicer (green, left image) and pure manual segmentation (blue, middle image). Egger et al., Nat Sci Rep., 2012.

Inherent Uncertainty 5

red: endocardium; green: epicardium; yellow: ground truthQueiros et al., European Heart Journal, 2016.

Segmentation EvaluationCan be considered to consist of two components:

(1) Theoretical

Study mathematical equivalence among algorithms.

(2) Empirical

Study practical performance of algorithms in specific application domains.

6

Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:

(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?

7



(Ch2) How to develop truly distinct methods constituting real advance?

8




(Ch3) How to choose a method for a given application domain?

9





(Ch4) How to set an algorithm optimally for an applicationdomain?

10





(Ch4) How to set an algorithm optimally for an applicationdomain?

Currently any method A can be shown empirically to be better than anymethod B, even when they are equivalent.

11

Segmentation Evaluation: Theoretical

Attributes commonly used by segmentation methods:

(1) Connectedness (2) Texture(3) Smoothness of boundary(4) Gradient / homogeneity(5) Shape information about object(6) Noise handling(7) Optimization employed(8) Orientedness of boundary

Attributes utilized by well-known delineation models

Connected Gradient Texture Smooth Shape Noise Optimize

Fuzzy con Yes Gr = hom affinity

Obj feat affinity

No No Scale FC

In RFC

Chan-Vese No No Yes Yes No No YesMum-Shah No No Yes Yes No Yes Yes

KWT snake Boundary Yes No Yes No No YesMSV LS Fg when

expandngYes No No No No No

Live wire Boundary Yes Yes Yes User No YesAct. shape Yes No No No Yes No YesAct. app Yes No Yes No Yes No Yes

Graph cut Usly not Yes Possible No No No Yes

Clustering No No Yes No No No Yes

SEGMENTATIONEVALUATION:Theoretical

Attributes utilized by well-known delineation models

Connected Gradient Texture Smooth Shape Noise Optimize

Fuzzy con Yes Gr = hom affinity

Obj feat affinity

No No Scale FC

In RFC

Chan-Vese No No Yes Yes No No YesMum-Shah No No Yes Yes No Yes Yes

KWT snake Boundary Yes No Yes No No YesMSV LS Fg when

expandngYes No No No No No

Live wire Boundary Yes Yes Yes User No YesAct. shape Yes No No No Yes No YesAct. app Yes No Yes No Yes No Yes

Graph cut Usly not Yes Possible No No No Yes

Clustering No No Yes No No No Yes

SEGMENTATIONEVALUATION:Theoretical

Deep Learning Yes Yes Yes Yes Yes Yes Yes

Segmentation Evaluation: Empirical

T :

B :

P :

Example: Estimating the volume of brain.

A body region -

Imaging protocol -

Application domain: A particular triple .

A task -

Example: Head.

Example: T2 weighted MRimaging with a particular set of parameters.

Q: A set of scenes acquired for a particular application domain

, ,á ñT B P

, , .T B Pá ñ


16

The segmentation efficacy of a method M in an applicationdomain may be characterized by three groupsof factors:

Precision :(Reliability)

Repeatability taking into account all subjective actions influencing the result.

Accuracy :(Validity)

Degree to which the result agrees withtruth.

Efficiency : (Viability)

Practical viability of the method.

, ,T B Pá ñ

Validation of Image Segmentation• Spectrum of accuracy versus realism in reference standard.• Digital phantoms.

– Ground truth known accurately.– Not so realistic.

• Acquisitions and careful segmentation.– Some uncertainty in ground truth.– More realistic.

• Autopsy/histopathology.– Addresses pathology directly; resolution.

• Clinical data ?– Hard to know ground truth.– Most realistic model.

Slide Credit: N. Archip

Comparison To Higher Resolution

MRI Photograph MRI

Provided by Peter Ratiu and Florin Talos.Credit: N. Archip


19

Intra operator variationsInter operator variations

Intra scanner variationsInter scanner variations

Inter scanner variations include variations due to the same brand and different brands.

Repeatability taking into account all subjective actions that influence the segmentation result.

Precision


20

Precision

( ) -

1 - , = 3, 4. + 2

1 2

i

1 2

O OM MT

M O OM M

PR i=C C

C C

A measure of precision for method M in a trial that producesand for situation Ti is given by

Intra/inter operator

Intra/inter scanner

may be binary or fuzzy segmentations.

1OMC 2O

MC

CMO1,CM

O2


21

Accuracy

The degree to which segmentations agree with true segmentation.

Surrogates of truth are needed.

For any image C acquired for application domain

CMO - segmentation of O in C by method M ,

Ctd - surrogate of true delineation of O in C.

22

TPFP

TN

FN

True segmentation

OMC

tdC

Segmentation by algorithm M.

FP

FN

Ud


23

FNVFMd =

Ctd − CMO

Ctd, TPVFM

d = Ctd ∩ CM

O

Ctd

FPVFMd =

CMO − CtdUd -Ctd

, TNVFMd =

Ud − CMO -Ctd

Ud -Ctd,

Ud : A binary scene representing a reference super set(for example, this may be the body region that is imaged).

: Amount of tissue truly in that is missed by .

: Amount of tissue falsely delineated by .

dMdM

FNVF O MFPVF M


24

Requirements for accuracy metrics:

(1) Capture M’s behavior of trade-off between FP and FN.(2) Satisfy laws of tissue conservation:

(3) Capable of characterizing the range of behavior of M.(4) Any monotonic function g(FNVF, FPVF) is fine as a

metric.(5) Appropriate for

1

1

d dM Md dM M

FNVF TPVFFPVF TNVF

= -

= -

, , .T B Pá ñ

25



26

1-FNVF

FPVF

Brain WM segmentation in PD MRimages.

Each value of parameter vector p of M gives a point on the DOC curve.The DOC curve characterizes the behavior of M over a range of parametric values of M.

Delineation Operating Characteristic

:MA Area underthe DOC curve


27

, ,á ñT B P

.

FPVF

1-FN

VF

0

1p - parameter vector for method M

gp(FPVF, FNVF) - monotonic fn

p* = arg min p [gp(FPVF, FNVF)]

Set M to operate at p*.

Optimally setting an algorithm for

1

Existent Segmentation Data28

Expert 1 Expert 2 Expert 3 Expert 4

Original Image

• Manual segmentation performed by 4 independent experts

• low grade glioma

Expert and Student Segmentations

29

Test image ? ?

? ?

Expert and Student Segmentations

30

Test image Expert consensus Student 1

Student 2 Student 3


31

Describes practical viability of a method.

Four factors should be considered:

(1) Computational time – for one time training of M

(2) Computational time – for segmenting each scene

(3) Human time – for one-time training of M

(4) Human time – for segmenting each scene

(2) and (4) are crucial. (4) determines the degree of automation of M.

Efficiency

( )1cMt( )2cMt( )1hMt

( )2hMt


32

Precision : Accuracy :

:::

: Area under the DOC curveintra scannerFN fraction for delineation:inter operatorFP fraction for delineation:intra operator1T

MPR

2TMPR

3TMPR

dMFPVF

MA

dMFNVF

Efficiency :

operator time for scene segmentation.:operator time for algorithm training.:computational time for scene segmentation.:computational time for algorithm training.:1c

Mt2cMt

1hMt

2hMt

4TMPR : inter scanner

Remarks

33

(1) Precision, accuracy, efficiency are interdependent.

accuracy à efficiency.precision and accuracy à difficult.

(2) “Automatic segmentation method” has no meaning unless theresults are proven on a large number of data sets withacceptable precision, accuracy, efficiency, and with .

(3) A descriptive answer to “is method M1 better than M2 under ?” in terms of the 11 parameters is more meaningful

than a “yes” or “no” answer.

(4) DOC is essential to describe the range of behavior of M.

2hMt = 0

, ,T B Pá ñ

Velazquez et al, Scientific Reports 2013.34

Shape Based Metrics for Segmentation Evaluation

35

Sensitivity=94.69%Specificity=94.19%

Sensitivity=72.99%Specificity=78.16%

If you use only DSC (dice similarity, or overlap measure), DSC values are similar to each otherIn both examples (but not sensitivity-specificity values).

Sufficient Enough?

Hausdorff Distance• Can be used for a complementary evaluation metric to the

overlap measure for measuring boundary mismatches!

36

Hausdorff Distance• Can be used for a complementary evaluation metric to the

overlap measure for measuring boundary mismatches!• Lower Haussdorff Distance (HD), Better segmentation

accuracy!

37

( ))(max),(maxmax),( bdadBAHD ABbBAa ÎÎ=

( )),(min)( badadBbB Î

= is a distance of one point a on A from B

Segmentation Evaluation: STAPLE38

• STAPLE (Simultaneous Truth and Performance Level Estimation):– An algorithm for estimating performance and ground truth from a

collection of independent segmentations.– Warfield, Zou, Wells MICCAI 2002.– Warfield, Zou, Wells, IEEE TMI 2004.– Publicly Available

– The STAPLE algorithm ( Warfield et al., 2004) is a region formulation for producing consensus segmentations.

– When foreground is small à weight w is small

Segmentation Evaluation: STAPLE• Segmentations are generated by sampling independently at

each voxel.

• However, the produced segmentations may not be realistic for two reasons. – First, the variability of the segmentation does not account for the

intensity in the image such that borders with strong gradients are equally variable as borders with weak gradient. This is counter intuitive as the basic hypothesis of image segmentation is that changes of intensity are correlated with changes of labels.

– Second, borders of the segmented structures are unrealistic mainly due to their lack of geometric regularity.

39

Regression Analysis in Clinical Problems

• Linear regression between volume(s) – automated segmentation’s volume vs. manual segmentation’s volume– Bland-Altman plot

• Linear regression between visual inspection (raters)– Kappa statistics– t-test / p-value

• Significantly different volumes ? Score ?

40

Regression Analysis in Clinical Problems

41

Manual segmentationVedentham, et al. JCIS, 2014

What is Bland-Altman plot?

42

What is Bland-Altman plot?• is a method of data plotting used in analyzing the agreement

between two different assays.• Claim: any two methods that are designed to measure the

same parameter should have good correlation.– X-axis: mean of the two measurement– Y-axis: difference between the two values

• Good first step analyzing the data!

43

Bland-Altman Plots (e.g., airway segmentation evaluation)

44

Xu, Bagci, et al. MedIA, 2015.

New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)

• Automatically produce plausible image segmentation samples from a single expert segmentation!

45



• A probability distribution of image segmentation boundaries is defined as Gaussian Process, which leads to segmentations which are spatially coherent and consistent with the presence of salient borders in the image.

46

The Gaussian Density

47

Remark: Gaussian Process (GP) ?

48

Credit: Ghahramani


49

Credit: Ghahramani


50

Credit: Ghahramani

Remark: (GP) ? 51

Remark: (GP) ? 52



• A probability distribution of image segmentation boundaries is defined as Gaussian Process, which leads to segmentations which are spatially coherent and consistent with the presence of salient borders in the image.

53

Sample segmentation contours according to mean inter-sample dice coefficient!

54

(Top Left) Mean of the GP µ; (Top Middle) Sample of the level set function φ(a) drawn from𝒢𝒫(µ,Σ) (Others) GPSSI samples. The ground truth is outlined in red, the GPSSI samples are outlined in orange.

55


(Left) Signed geodesic distance µ(a) of the ROI with isocontours –45, 0, 45, 100, 200. (Right) One can check that the samples most probably lie in the region delineated by the isocontours µ(a)=±45. The sampled contours are in orange.


56


57

Provocative Question?• Can we evaluate segmentation error without the ground

truth?

58

Provocative Question?• Can we evaluate segmentation error without the ground

truth?– With the machine learning support, can we design a classifier which

LEARNS segmentation error and adapt itself for better delineation?

59

Summary• Segmentation Evaluation

– Theoretical vs. Empirical– Visual Assessment– Volumetric Agreement– Efficacy (efficiency, accuracy, …)– STAPLE– New Trends!– Segmentation Challenges (choose your project!)

60

Slide Credits and References• Credits to: Jayaram K. Udupa of Univ. of Penn., MIPG• Bagci’s CV Course 2015 Fall.• K.D. Toennies, Guide to Medical Image Analysis,• Handbook of Medical Imaging, Vol. 2. SPIE Press.• Handbook of Biomedical Imaging, Paragios, Duncan, Ayache.• Seutens,P., Medical Imaging, Cambridge Press.• Neculai Archip, Ph.D• Simon K. Warfield, Ph.D. (See STAPLE Algorithm)

61

Science

Lec14: Evaluation Framework for Medical Image Segmentation