Validation of automatic bone age determinationin children with congenital adrenal hyperplasia
David D. Martin & Katharina Heil & Conrad Heckmann &Angelika Zierl & Jrgen Schaefer & Michael B. Ranke &Gerhard Binder
Received: 7 September 2011 /Revised: 6 April 2013 /Accepted: 22 April 2013 /Published online: 5 October 2013# Springer-Verlag Berlin Heidelberg 2013
AbstractBackground Determination of bone age is routinely used forfollowing up substitution therapy in congenital adrenalhyperplasia (CAH) but today is a procedure with significantsubjectivity.Objective The aim was to test the performance of automaticbone age rating by the BoneXpert software package in allradiographs of children with CAH seen at our clinic from1975 to 2006.Materials and methods Eight hundred and ninety-two left-hand radiographs from 100 children aged 0 to 17 years werepresented to a human rater and BoneXpert for bone agerating. Images where ratings differed by more than 1.5 yearswere each rerated by four human raters.Results Rerating was necessary in 20 images and the reratingresult was closer to the BoneXpert result than to the originalmanual rating in 18/20 (90 %). Bone age rating precisionbased on the smoothness of longitudinal curves comprising atotal of 327 data triplets spanning less than 1.7 years showedBoneXpert to be more precise (P
Treatment of children with CAH involves regular mea-surement of 17-hydroxyprogesterone as well as bone age andheight determination . Some authors consider bone age tobe the most useful follow-up parameter after body height,partly due to the fact that 17-hydroxyprogesterone respondsquickly to medication, thus giving no indication of a patientslong-term compliance.
Rating bone age in CAH can require greater thanaverage expertise, especially when bone age is extremelyadvanced. In times of demands for higher productivity inradiology, radiologic expertise is becoming an increasinglyprecious resource and the advance of computerized bone agerating methods is therefore welcomed by many as a potentialmeans of relieving radiologists from analyzing large numbersof unremarkable radiographs. The CE (European Conformi-ty)-marked medical device used for this purpose is BoneXpert(Visiana, Holte, Denmark), a software package that can beused on one image at a time in daily routine or, alternatively, toanalyze large quantities of hand radiographs in an unattendedbatch job and whose rating reliability in healthy populationshas been demonstrated . Furthermore, the automatedmethod has been successfully used for children with shortstature of various diagnoses  or with central precociouspuberty . However, it has not to our knowledge beenvalidated in children exposed to androgens since earlygestation, such as in the case of children with CAH. Further,our cohort of children with CAH includes some children withextremely advanced bone age due to the fact that they hademigrated from countries where screening for CAH was notinstitutionalized at the time of their birth. These children areparticularly challenging in terms of management, starting withtheir bone age reading. We were interested to test whether theautomated method could cope with radiographs from suchchildren.
The purpose of this study was to determine howautomated bone age performs on radiographs of childrenwith CAH compared with human bone age raters.
Materials and methods
Eight hundred and ninety-two left-hand radiographs from100 children and adolescents aged 0 to 17 years with adiagnosis of CAH who had been treated at our clinic duringthe period from January 1, 1975, to December 31, 2006, wereincluded in the study. If the films were not already availablein digital form as DICOM images, they were scanned with aVidar Diagnostic Pro Advantage scanner (Vidar, Hemdon,VA). The study was approved by the Tbingen UniversityHospital Ethics Committee.
Automatic bone age ratingwas performed usingBoneXpertversion 1.0 (Visiana, Holte, Denmark, www.BoneXpert.com).BoneXpert calculates bone age based on the shape and
appearance of 13 bones of the hand: the phalanges andmetacarpals of the first, third and fifth ray and the radius andulna. The program rejects a bone if its shape is abnormal or ifits bone age deviates by more than 2.4 years from the mean ofall bones. Furthermore, if fewer than 8 bones are accepted, theentire radiograph is rejected for bone age analysis. TheintendedGreulich-Pyle bone age range for the automatic ratingin its current version is 2 to 15 years for girls, and 2.5 to17 years for boys. A more detailed description of thecalculation methods employed by BoneXpert has beenpublished elsewhere [17, 19].
In 35 examinations, the image was available both as aDICOM file and as a printout on film of the DICOM image. Acomparison of the ratings performed by automated bone ageon the DICOM files and on the scanned film-printoutsyielded differences no greater than 0.5 years, of which 18were greater than 0.2 years. The mean difference was0.0 years, i.e. there was no trend for the scanned films to yieldolder or younger bone age values than the DICOM images.The DICOM file was used whenever one was available.
Since an objective measure of bone age does not exist,we must resort to various indirect ways of assessing thevalidity (accuracy) of a new rating method. Thodberget al.  circumvented the problem of the lack of anobjective bone age by judging the accuracy of bone agerating on the basis of its ability to predict an objectiveparameter to which it is related, namely final height.However, this approach is not recommendable in childrenwhose final height may be affected by subsequent treatment.In this paper, we look for deviations from the resultsobtained with the manual method in terms of bias andstandard deviation. This is an example of analysis ofagreement between two measurements, so it was naturalto use Bland-Altman plots and in accordance with thisconcept to use the standard deviation rather than thecorrelation. It was tested whether the bias is significantlydifferent from zero (a qualitative result), while the standarddeviation was reported as a quantitative endpoint.
Notice that the standard deviation of the differencesbetween the two methods is usually larger than the standarddeviation from the line of fit because it ignores the bone age-related slope between the two methods in a Bland-Altmanplot.
An additional, quantitative test was included as follows:Images for which BoneXpert bone age (BXBA) and manualbone age (ManBA) differed by more than 1.5 years werererated by four experienced raters. Rerating was donewithout knowledge of ManBA, BXBA and chronologicalage. The mean of the four bone age values, referred to as theReferenceBA, was compared anewwith BXBA andManBA.
1616 Pediatr Radiol (2013) 43:16151621
The same approach has been used previsouly [17, 18].ReferenceBA acts as a secondary and very reliable outcome.The performance of BXBA and ManBA in terms of theirdeviation from ReferenceBA was calculated using the 1-sample proportions test with continuity correction, takinginto account that only observations from different subjectswere statistically independent.
By contrast, there is no need for an objective measure ofbone age to determine the precision of a new bone agerating method, i.e. its ability to generate reproducibleresults. One way to determine this is to study thesmoothness of longitudinal curves obtained. In thepresent study, the precision of automatic and manualrating was assessed in terms of the smoothness oflongitudinal curves using the triplet method . Thisinvolves breaking down individual longitudinal bone ageseries with n bone age measurements into n 2 tripletsof consecutive bone age measurements and consideringthe residual between the middle bone age and the linearinterpolation between the two measurements on eitherside, for each triplet. The triplet method assumes thatthe three bone age measurements lie on a straight line asa function of age if there is no precision error. Hence,any deviation from the line is interpreted as due toprecision error of the bone age method. Since in realitymany more factors lead to a deviation from a straightline, the triplet method yields an upper limit of the trueprecision. To improve the estimate, we only consideredtriplets spanning less than 1.7 years. This left us with327 triplets out of the original 604 (54 %). The result ofthe precision analysis is both quantitative and qualita-tive. The quantitative result was the estimated value of theprecision of the automated and manual methods given withconfidence intervals. In addition,we compared the precision ofmanual and automated methods and tested whether they weresignificantly different, i.e. a qualitative result.1
To further illustrate the behaviour and precision ofautomated bone age compared with manual bone agerating in children with extremely deviant bone age, weplotted the longitudinal course of three children whoseskeletal maturity was particularly advanced. For this weselected, from the subset of children of whom at least12 images were available, the three with the most
advanced bone age (mean of ManBA and BXBA minusCA) at their first visit.
Statistical calculations were performed using the JMP 9software package (SAS Institute Inc., Cary, NC) and version2.12.1 of the R statistics software (www.r-project.org).
Analysis of rejected images
One hundred sixteen of the 892 images were rejected byBoneXpert. In 111 of these, the rejection was due to bone agebeing below BoneXperts specified rating range, i.e. ManBAbelow 2.0 years in girls or below 2.5 years in boys. For theremaining five images, three were due to poor image quality,one to improper scanning, and one remains unexplained andthus represents the inefficiency of the automated method.
Comparison between ManBA and BXBA
For the 776 images (480 from girls, 296 from boys; Table 1)analyzed by automated bone age, the mean difference BXBAManBAwas 0.02 years (N.S.), the slope of the line of fit ina Bland-Altman plot was negligible at 0.02 years/year andthe standard deviation (SD) of the signed differences was0.72 years. The mean of the absolute (unsigned) differenceswas 0.54 years (SD 0.40 years). In 20 images, the absolutedifference between BXBA and ManBA was greater than1.5 years (Fig. 1).
These images were submitted for blind rerating by fourraters (two radiologists and two pediatric endocrinologists all having between 10 and 30 years of practice in bone agerating). None of the rerated images showed an absolutedifference between ReferenceBA and BXBA greater than1.5 years. ReferenceBAwas closer toManBA in 2 images andcloser to BXBA in 18 (Table 2). To estimate the statisticalsignificance of this observed advantage ofBXBA,we note thatthe 20 rerated images are from ten children, so we have tenrather than 20 independent observations, and the two caseswhere ManBA is better than BXBA occur in children withthree visits, so for these two children, ManBAwas better thanBXBA in one-third of the visits. Thus, we have observedBXBA to be better thanManBA in 9.33/10 independent cases
Table 1 Characteristics at the time of the first radiograph (childrenwhose radiographs were accepted by BoneXpert): mean chronologicalage and manual bone age (ManBA) by gender
Gender Number ofchildren
Mean age SD(y)
Mean ManBA SD(y)
Boys 41 5.0 3.4 7.2 5.3
Girls 52 4.3 3.2 5.0 3.4
Pediatr Radiol (2013) 43:16151621 1617
1 The triplets formed from a child are not statistically independentbecause each visit can participate in up to three triplets. The confidenceinterval (CI) of the estimated precision was therefore estimated by aMonte Carlo technique, which showed that the CIs are a factor of 1.3times larger than one would derive by assuming that the triplets wereindependent
(93 %). We now take as null hypothesis that ManBA and BXare equally close to theReferenceBA and a proportion test thenshows that BXBA is closer to ReferenceBA than is ManBAwithP=0.02. (The test for the proportion 1 in 10 givesP=0.027and for 0 in 10 gives P=0.004, and the quoted p-value is theinterpolation to the proportion 0.67 in 10.) Notice also that in
this computation of statistical significance it is irrelevant howmany images and subjects there were in the total study, prior toselecting the disputed cases.
Our analysis of bone age rating precision based on thesmoothness of longitudinal curves comprised a total of 327data triplets spanning less than 1.7 years for bothManBA and
Table 2 Bone age rating results for the 20 images with a difference between the original manual rating (ManBA) and automatic rating (BXBA)greater than 1.5 years. These were subsequently rerated (ReferenceBA)
ID Image no. Gender Age (y) Bone age ratings (y) Bone age rating differences (y) #
ReferenceBA ManBA BXBA ManBA-ReferenceBA BXBA-ReferenceBA BXBAworse
3187 7 boy 5.7 10.1 11.5 9.4 1.4 0.7
3187 9 boy 7.0 11.2 12.5 10.8 1.3 0.4
3242 7 boy 6.3 10.0 11.5 9.6 1.5 0.4
3242 9 boy 7.2 9.9 12.8 9.6 2.9 0.3
3752 6 boy 9.9 14.7 17.0 14.3 2.3 0.4
4326 16 girl 14.4 13.5 15.0 13.3 1.5 0.2
5131 11 girl 6.7 6.8 5.0 6.6 1.8 0.25131 14 girl 8.9 10.0 8.8 10.5 1.2 0.5
5818 9 girl 5.9 5.4 4.5 6.4 0.9 1.0 X
5818 10 girl 6.5 6.7 5.0 6.8 1.7 0.1
5818 20 girl 11.6 10.9 9.8 11.5 1.2 0.6
5989 7 girl 14.3 16.8 18.0 15.9 1.2 0.9
7653 1 boy 2.4 5.3 2.8 5.8 2.6 0.57653 3 boy 7.8 6.3 4.5 6.9 1.8 0.67653 4 boy 9.1 7.0 5.8 7.4 1.3 0.4
7653 5 boy 10.3 7.3 6.3 7.9 1.1 0.6
8253 3 boy 5.6 12.9 14.0 12.4 1.1 0.5
9516 2 girl 2.6 5.4 6.0 4.4 0.6 1.0 X
9516 3 girl 4.1 5.8 7.0 4.7 1.2 1.1
9516 13 girl 8.0 7.9 9.0 7.3 1.1 0.6
Mean of signed differences (SD) 0.1 (1.6) 0.0 (0.6)Mean of absolute differences (SD) 1.5 (0.6) 0.5 (0.3)
p-value for BXBA being more accurate than ManBA, relative to ReferenceBA P=0.02
# Differences to ReferenceBA larger than 1.5 are highlighted in bold
Images where ReferenceBAwas closer to ManBA than to BXBA are marked with X BXBAwas significantly closer to ReferenceBA than wasManBA
Fig. 1 Bland-Altman plot of the relation between automatic (BXBA)and the original manual rating (ManBA) in boys (circles) and girls(dots). The difference between the two methods is shown against the
mean of the two methods. The dotted lines indicate the range ofdiscrepancy 1.5 years are analyzed in Table 2
1618 Pediatr Radiol (2013) 43:16151621
- BXBA. The following precision results were obtained:ManBA: 0.32 years (95%CI: 0.290.35); BXBA: 0.21 years(95 % CI: 0.190.23). This indicates a significant differencein precision (P
Rerating results: implications for automated and manualrating
In 20 images, the discrepancy between manual and automaticrating was greater than 1.5 years. These were each blindlyrerated by four independent raters. The mean of these fourratings (ReferenceBA) deviated by less than 1.5 years fromBXBA for all images.
We can conclude from our results that the originaldiscrepancies in the 20 rerated images were due more tomanual errors than to errors in automa...