Upload
ulas-bagci
View
10
Download
0
Embed Size (px)
Citation preview
MEDICAL IMAGE COMPUTING (CAP 5937)
LECTURE 14: Evaluation Framework for Medical Image Segmentation
Dr. Ulas BagciHEC 221, Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), Orlando, FL [email protected] or [email protected]
1SPRING 2017
Outline• How to evaluate accuracy of image segmentation?
– Gold standard ~ surrogate of truths– Qualitative
• Visual• Inter- and intra-observer agreement rates
– Quantitative• Volumetric measurements (regression)• Region overlaps• Shape based measurements• Theoretical comparisons• STAPLE, Uncertainty guidance, and evaluation w/o truths
2
Visual Assessment
3
Manual image segmentation from the full spectrum of IDEAL MRI data to delineate red: SAT, green: VAT, blue: liver, yellow: pancreas, purple: kidneys. Left to right: water- only, fat-only, in-phase, out-of-phase, fat fraction, and segmented labels from SliceOmatic.
Reference: Assessment of Abdominal Adiposity and Organ Fat with Magnetic Resonance Imaging (chp11).
Inherent Uncertainty
4
Comparison of glioblastoma multiforme (GBM) segmentation results on an axial slice: semi-automatic segmentation under Slicer (green, left image) and pure manual segmentation (blue, middle image). Egger et al., Nat Sci Rep., 2012.
Inherent Uncertainty 5
red: endocardium; green: epicardium; yellow: ground truthQueiros et al., European Heart Journal, 2016.
Segmentation EvaluationCan be considered to consist of two components:
(1) Theoretical
Study mathematical equivalence among algorithms.
(2) Empirical
Study practical performance of algorithms in specific application domains.
6
Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:
(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?
7
Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:
(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?
(Ch2) How to develop truly distinct methods constituting real advance?
8
Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:
(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?
(Ch2) How to develop truly distinct methods constituting real advance?
(Ch3) How to choose a method for a given application domain?
9
Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:
(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?
(Ch2) How to develop truly distinct methods constituting real advance?
(Ch3) How to choose a method for a given application domain?
(Ch4) How to set an algorithm optimally for an applicationdomain?
10
Segmentation Evaluation: TheoreticalFundamental challenges in segmentation evaluation:
(Ch1) Are major pI (purely Image based) frameworks such as activecontours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them?
(Ch2) How to develop truly distinct methods constituting real advance?
(Ch3) How to choose a method for a given application domain?
(Ch4) How to set an algorithm optimally for an applicationdomain?
Currently any method A can be shown empirically to be better than anymethod B, even when they are equivalent.
11
Segmentation Evaluation: Theoretical
Attributes commonly used by segmentation methods:
(1) Connectedness (2) Texture(3) Smoothness of boundary(4) Gradient / homogeneity(5) Shape information about object(6) Noise handling(7) Optimization employed(8) Orientedness of boundary
Attributes utilized by well-known delineation models
Connected Gradient Texture Smooth Shape Noise Optimize
Fuzzy con Yes Gr = hom affinity
Obj feat affinity
No No Scale FC
In RFC
Chan-Vese No No Yes Yes No No YesMum-Shah No No Yes Yes No Yes Yes
KWT snake Boundary Yes No Yes No No YesMSV LS Fg when
expandngYes No No No No No
Live wire Boundary Yes Yes Yes User No YesAct. shape Yes No No No Yes No YesAct. app Yes No Yes No Yes No Yes
Graph cut Usly not Yes Possible No No No Yes
Clustering No No Yes No No No Yes
SEGMENTATIONEVALUATION:Theoretical
Attributes utilized by well-known delineation models
Connected Gradient Texture Smooth Shape Noise Optimize
Fuzzy con Yes Gr = hom affinity
Obj feat affinity
No No Scale FC
In RFC
Chan-Vese No No Yes Yes No No YesMum-Shah No No Yes Yes No Yes Yes
KWT snake Boundary Yes No Yes No No YesMSV LS Fg when
expandngYes No No No No No
Live wire Boundary Yes Yes Yes User No YesAct. shape Yes No No No Yes No YesAct. app Yes No Yes No Yes No Yes
Graph cut Usly not Yes Possible No No No Yes
Clustering No No Yes No No No Yes
SEGMENTATIONEVALUATION:Theoretical
Deep Learning Yes Yes Yes Yes Yes Yes Yes
Segmentation Evaluation: Empirical
T :
B :
P :
Example: Estimating the volume of brain.
A body region -
Imaging protocol -
Application domain: A particular triple .
A task -
Example: Head.
Example: T2 weighted MRimaging with a particular set of parameters.
Q: A set of scenes acquired for a particular application domain
, ,á ñT B P
, , .T B Pá ñ
Segmentation Evaluation: Empirical
16
The segmentation efficacy of a method M in an applicationdomain may be characterized by three groupsof factors:
Precision :(Reliability)
Repeatability taking into account all subjective actions influencing the result.
Accuracy :(Validity)
Degree to which the result agrees withtruth.
Efficiency : (Viability)
Practical viability of the method.
, ,T B Pá ñ
Validation of Image Segmentation• Spectrum of accuracy versus realism in reference standard.• Digital phantoms.
– Ground truth known accurately.– Not so realistic.
• Acquisitions and careful segmentation.– Some uncertainty in ground truth.– More realistic.
• Autopsy/histopathology.– Addresses pathology directly; resolution.
• Clinical data ?– Hard to know ground truth.– Most realistic model.
Slide Credit: N. Archip
Comparison To Higher Resolution
MRI Photograph MRI
Provided by Peter Ratiu and Florin Talos.Credit: N. Archip
Segmentation Evaluation: Empirical
19
Intra operator variationsInter operator variations
Intra scanner variationsInter scanner variations
Inter scanner variations include variations due to the same brand and different brands.
Repeatability taking into account all subjective actions that influence the segmentation result.
Precision
Segmentation Evaluation: Empirical
20
Precision
( ) -
1 - , = 3, 4. + 2
1 2
i
1 2
O OM MT
M O OM M
PR i=C C
C C
A measure of precision for method M in a trial that producesand for situation Ti is given by
Intra/inter operator
Intra/inter scanner
may be binary or fuzzy segmentations.
1OMC 2O
MC
CMO1,CM
O2
Segmentation Evaluation: Empirical
21
Accuracy
The degree to which segmentations agree with true segmentation.
Surrogates of truth are needed.
For any image C acquired for application domain
CMO - segmentation of O in C by method M ,
Ctd - surrogate of true delineation of O in C.
22
TPFP
TN
FN
True segmentation
OMC
tdC
Segmentation by algorithm M.
FP
FN
Ud
Segmentation Evaluation: Empirical
23
FNVFMd =
Ctd − CMO
Ctd, TPVFM
d = Ctd ∩ CM
O
Ctd
FPVFMd =
CMO − CtdUd -Ctd
, TNVFMd =
Ud − CMO -Ctd
Ud -Ctd,
Ud : A binary scene representing a reference super set(for example, this may be the body region that is imaged).
: Amount of tissue truly in that is missed by .
: Amount of tissue falsely delineated by .
dMdM
FNVF O MFPVF M
Segmentation Evaluation: Empirical
24
Requirements for accuracy metrics:
(1) Capture M’s behavior of trade-off between FP and FN.(2) Satisfy laws of tissue conservation:
(3) Capable of characterizing the range of behavior of M.(4) Any monotonic function g(FNVF, FPVF) is fine as a
metric.(5) Appropriate for
1
1
d dM Md dM M
FNVF TPVFFPVF TNVF
= -
= -
, , .T B Pá ñ
25
Segmentation Evaluation: Empirical
Segmentation Evaluation: Empirical
26
1-FNVF
FPVF
Brain WM segmentation in PD MRimages.
Each value of parameter vector p of M gives a point on the DOC curve.The DOC curve characterizes the behavior of M over a range of parametric values of M.
Delineation Operating Characteristic
:MA Area underthe DOC curve
Segmentation Evaluation: Empirical
27
, ,á ñT B P
.
FPVF
1-FN
VF
0
1p - parameter vector for method M
gp(FPVF, FNVF) - monotonic fn
p* = arg min p [gp(FPVF, FNVF)]
Set M to operate at p*.
Optimally setting an algorithm for
1
Existent Segmentation Data28
Expert 1 Expert 2 Expert 3 Expert 4
Original Image
• Manual segmentation performed by 4 independent experts
• low grade glioma
Expert and Student Segmentations
29
Test image ? ?
? ?
Expert and Student Segmentations
30
Test image Expert consensus Student 1
Student 2 Student 3
Segmentation Evaluation: Empirical
31
Describes practical viability of a method.
Four factors should be considered:
(1) Computational time – for one time training of M
(2) Computational time – for segmenting each scene
(3) Human time – for one-time training of M
(4) Human time – for segmenting each scene
(2) and (4) are crucial. (4) determines the degree of automation of M.
Efficiency
( )1cMt( )2cMt( )1hMt
( )2hMt
Segmentation Evaluation: Empirical
32
Precision : Accuracy :
:::
: Area under the DOC curveintra scannerFN fraction for delineation:inter operatorFP fraction for delineation:intra operator1T
MPR
2TMPR
3TMPR
dMFPVF
MA
dMFNVF
Efficiency :
operator time for scene segmentation.:operator time for algorithm training.:computational time for scene segmentation.:computational time for algorithm training.:1c
Mt2cMt
1hMt
2hMt
4TMPR : inter scanner
Remarks
33
(1) Precision, accuracy, efficiency are interdependent.
accuracy à efficiency.precision and accuracy à difficult.
(2) “Automatic segmentation method” has no meaning unless theresults are proven on a large number of data sets withacceptable precision, accuracy, efficiency, and with .
(3) A descriptive answer to “is method M1 better than M2 under ?” in terms of the 11 parameters is more meaningful
than a “yes” or “no” answer.
(4) DOC is essential to describe the range of behavior of M.
2hMt = 0
, ,T B Pá ñ
Velazquez et al, Scientific Reports 2013.34
Shape Based Metrics for Segmentation Evaluation
35
Sensitivity=94.69%Specificity=94.19%
Sensitivity=72.99%Specificity=78.16%
If you use only DSC (dice similarity, or overlap measure), DSC values are similar to each otherIn both examples (but not sensitivity-specificity values).
Sufficient Enough?
Hausdorff Distance• Can be used for a complementary evaluation metric to the
overlap measure for measuring boundary mismatches!
36
Hausdorff Distance• Can be used for a complementary evaluation metric to the
overlap measure for measuring boundary mismatches!• Lower Haussdorff Distance (HD), Better segmentation
accuracy!
37
( ))(max),(maxmax),( bdadBAHD ABbBAa ÎÎ=
( )),(min)( badadBbB Î
= is a distance of one point a on A from B
Segmentation Evaluation: STAPLE38
• STAPLE (Simultaneous Truth and Performance Level Estimation):– An algorithm for estimating performance and ground truth from a
collection of independent segmentations.– Warfield, Zou, Wells MICCAI 2002.– Warfield, Zou, Wells, IEEE TMI 2004.– Publicly Available
– The STAPLE algorithm ( Warfield et al., 2004) is a region formulation for producing consensus segmentations.
– When foreground is small à weight w is small
Segmentation Evaluation: STAPLE• Segmentations are generated by sampling independently at
each voxel.
• However, the produced segmentations may not be realistic for two reasons. – First, the variability of the segmentation does not account for the
intensity in the image such that borders with strong gradients are equally variable as borders with weak gradient. This is counter intuitive as the basic hypothesis of image segmentation is that changes of intensity are correlated with changes of labels.
– Second, borders of the segmented structures are unrealistic mainly due to their lack of geometric regularity.
39
Regression Analysis in Clinical Problems
• Linear regression between volume(s) – automated segmentation’s volume vs. manual segmentation’s volume– Bland-Altman plot
• Linear regression between visual inspection (raters)– Kappa statistics– t-test / p-value
• Significantly different volumes ? Score ?
40
Regression Analysis in Clinical Problems
41
Manual segmentationVedentham, et al. JCIS, 2014
What is Bland-Altman plot?
42
What is Bland-Altman plot?• is a method of data plotting used in analyzing the agreement
between two different assays.• Claim: any two methods that are designed to measure the
same parameter should have good correlation.– X-axis: mean of the two measurement– Y-axis: difference between the two values
• Good first step analyzing the data!
43
Bland-Altman Plots (e.g., airway segmentation evaluation)
44
Xu, Bagci, et al. MedIA, 2015.
New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)
• Automatically produce plausible image segmentation samples from a single expert segmentation!
45
New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)
• Automatically produce plausible image segmentation samples from a single expert segmentation!
• A probability distribution of image segmentation boundaries is defined as Gaussian Process, which leads to segmentations which are spatially coherent and consistent with the presence of salient borders in the image.
46
The Gaussian Density
47
Remark: Gaussian Process (GP) ?
48
Credit: Ghahramani
Remark: Gaussian Process (GP) ?
49
Credit: Ghahramani
Remark: Gaussian Process (GP) ?
50
Credit: Ghahramani
Remark: (GP) ? 51
Remark: (GP) ? 52
New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)
• Automatically produce plausible image segmentation samples from a single expert segmentation!
• A probability distribution of image segmentation boundaries is defined as Gaussian Process, which leads to segmentations which are spatially coherent and consistent with the presence of salient borders in the image.
53
Sample segmentation contours according to mean inter-sample dice coefficient!
54
(Top Left) Mean of the GP µ; (Top Middle) Sample of the level set function φ(a) drawn from𝒢𝒫(µ,Σ) (Others) GPSSI samples. The ground truth is outlined in red, the GPSSI samples are outlined in orange.
55
New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)
(Left) Signed geodesic distance µ(a) of the ROI with isocontours –45, 0, 45, 100, 200. (Right) One can check that the samples most probably lie in the region delineated by the isocontours µ(a)=±45. The sampled contours are in orange.
New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)
56
New Directions: Sampling Image Segmentations (Le et al, MedIA, 2016)
57
Provocative Question?• Can we evaluate segmentation error without the ground
truth?
58
Provocative Question?• Can we evaluate segmentation error without the ground
truth?– With the machine learning support, can we design a classifier which
LEARNS segmentation error and adapt itself for better delineation?
59
Summary• Segmentation Evaluation
– Theoretical vs. Empirical– Visual Assessment– Volumetric Agreement– Efficacy (efficiency, accuracy, …)– STAPLE– New Trends!– Segmentation Challenges (choose your project!)
60
Slide Credits and References• Credits to: Jayaram K. Udupa of Univ. of Penn., MIPG• Bagci’s CV Course 2015 Fall.• K.D. Toennies, Guide to Medical Image Analysis,• Handbook of Medical Imaging, Vol. 2. SPIE Press.• Handbook of Biomedical Imaging, Paragios, Duncan, Ayache.• Seutens,P., Medical Imaging, Cambridge Press.• Neculai Archip, Ph.D• Simon K. Warfield, Ph.D. (See STAPLE Algorithm)
61