16
Computer Methods and Programs in Biomedicine 58 (1999) 35 – 50 CORA — A knowledge-based system for the analysis of case-control studies Katharina Morik a , Iris Pigeot b, *, Ursula Robers c a Department of Computer Science VIII, Uni6ersity of Dortmund, D-44221 Dortmund, Germany b Institute of Statistics, Uni6ersity of Munich, Ludwigstrasse 33, D-80539 Mu ¨nchen, Germany c Uni6ersity of Dortmund, Joseph -6on -Fraunhofer -Strasse 20, D-44227 Dortmund, Germany Received 17 May 1996; accepted 20 April 1998 Abstract Carrying out a statistical analysis, the researcher is concerned with the problem of choosing an appropriate statistical technique from a large number of competing methods. Most common statistical software offer different methods for analysing the data without giving any support regarding the adequacy of a method for a particular data set. This paper outlines the main features of the computer system CORA which provides a statistical analysis of stratified contingency tables and additionally supports the researcher at the different steps of this analysis. Here, the support given by the system consists of two different aspects. On the one hand, the help system of CORA contains general information on the implemented statistical methods which can be obtained on request. On the other hand, an advice tool recommends an adequate statistical method which depends on the actual empirical case-control data to be analysed. To build up the advice tool, a set of rules being discovered by machine learning from simulation studies is integrated into the system CORA. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Case control studies; Knowledge based system; Machine learning; Odds ratio; Simulation studies; Statistical methods 1. Introduction The aim of epidemiological case-control studies is to investigate possible associations between a potential risk factor and a certain disease. Carry- ing out such a study the number of people are recorded having the disease or not, and being exposed or not. For further statistical analyses, these four absolute frequencies are usually or- dered in a 2 ×2 table. A quantifying statistical measure for the investigated association is the so-called odds ratio. It can be interpreted as the factor by which the risk of disease increases if a * Corresponding author. Tel.: +49 89 21803522; e-mail: [email protected] 0169-2607/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII S0169-2607(98)00065-0

CORA—A knowledge-based system for the analysis of case-control studies

Embed Size (px)

Citation preview

Page 1: CORA—A knowledge-based system for the analysis of case-control studies

Computer Methods and Programs in Biomedicine 58 (1999) 35–50

CORA—A knowledge-based system for the analysis ofcase-control studies

Katharina Morik a, Iris Pigeot b,*, Ursula Robers c

a Department of Computer Science VIII, Uni6ersity of Dortmund, D-44221 Dortmund, Germanyb Institute of Statistics, Uni6ersity of Munich, Ludwigstrasse 33, D-80539 Munchen, Germany

c Uni6ersity of Dortmund, Joseph-6on-Fraunhofer-Strasse 20, D-44227 Dortmund, Germany

Received 17 May 1996; accepted 20 April 1998

Abstract

Carrying out a statistical analysis, the researcher is concerned with the problem of choosing an appropriatestatistical technique from a large number of competing methods. Most common statistical software offer differentmethods for analysing the data without giving any support regarding the adequacy of a method for a particular dataset. This paper outlines the main features of the computer system CORA which provides a statistical analysis ofstratified contingency tables and additionally supports the researcher at the different steps of this analysis. Here, thesupport given by the system consists of two different aspects. On the one hand, the help system of CORA containsgeneral information on the implemented statistical methods which can be obtained on request. On the other hand, anadvice tool recommends an adequate statistical method which depends on the actual empirical case-control data tobe analysed. To build up the advice tool, a set of rules being discovered by machine learning from simulation studiesis integrated into the system CORA. © 1999 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: Case control studies; Knowledge based system; Machine learning; Odds ratio; Simulation studies;Statistical methods

1. Introduction

The aim of epidemiological case-control studiesis to investigate possible associations between apotential risk factor and a certain disease. Carry-

ing out such a study the number of people arerecorded having the disease or not, and beingexposed or not. For further statistical analyses,these four absolute frequencies are usually or-dered in a 2×2 table. A quantifying statisticalmeasure for the investigated association is theso-called odds ratio. It can be interpreted as thefactor by which the risk of disease increases if a

* Corresponding author. Tel.: +49 89 21803522; e-mail:[email protected]

0169-2607/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved.

PII S0169-2607(98)00065-0

Page 2: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5036

person is exposed to the risk factor of interest. Letus denote the probability for a case being exposedby p1 and for a control by p0. Then, the odds ratiois defined as p1(1−p0)/p0(1−p1). Typically, thereare additional confounding variables being associ-ated with the risk factor and also having aninfluence on the disease of interest. These con-founders have to be controlled to ensure that theodds ratio reflects the only influence of the riskfactor. One possibility for controlling such con-founders consists in a stratification of the dataaccording to the categories of the confounderwhere a 2×2 table is constructed for each cate-gory. If the confounder is controlled, i.e. all indi-vidual odds ratios are equal (homogeneity), aso-called common odds ratio c is to be estimatedfrom the data. A great variety of estimators of thecommon odds ratio with different statistical prop-erties is available, see [1–5]. The behaviour of theestimators heavily depends on the characteristicsof the case-control data to analyse.

The researcher is thus confronted with theproblem of choosing an adequate estimator out ofthe large pool of competing techniques. To copewith this problem, the computer system CORA(combined odds ratio analysis) was developedwhich assists the researcher in analysing the data.This support is achieved by two system compo-nents: an enlarged hypertext help system and aknowledge-based advice tool. The help system,which offers information about CORA and howto use it, is extended by some general aspectsconcerning a stratified contingency table analysisas well as the statistical properties of the imple-mented estimators and some other statisticalmethods used in the analysis. The advice toolsuggests an appropriate estimator for the commonodds ratio. This recommendation depends on thecharacteristics of the underlying case-control dataentered into the system. In contrast, the informa-tion given by the help system, although it iscontext-sensitive, does not consider the actualcharacteristics of the data to analyse.

In order to build up the advice tool we haveapplied machine learning. The finite-sample be-haviour of statistical methods can be determinedby simulation studies, which are valuable, if ananalytical investigation would be too complicated

or even impossible. From the results of thesestudies, represented in a knowledge base, a char-acterization of estimators can be learned. Theresulting set of rules then underlies the advicetool.

The presented way of constructing such systemscan be seen as a general approach which is appli-cable in many fields of research: with the means ofmachine learning you can discover knowledge insimulation studies and then integrate this knowl-edge into a system which supports the user byrecommendations (guidelines) how to proceed.

Knowledge-based techniques have been widelyused when developing intelligent statistical soft-ware. But most of the statistical expert systemsthat have been worked out since the mid 1980’sfell back to a transfer view of knowledge acquisi-tion. This transfer view is based on the assump-tion that with the help of tools—namely expertsystem shells—knowledge can be easily carriedover from the expert to the system. Almost all ofthese systems failed in practice. This false assess-ment of the knowledge acquisition process may beone reason for their failure.

In [6] Morik proposed to focus on modellingthe knowledge and has created a different view ofknowledge acquisition. With the help of toolsassisting this modelling process, knowledge can bediscovered and revised. Thus, it can be madeavailable for knowledge-based systems (KBS) in amore appropriate way. Hence, KBS can now bedeveloped that essentially differ from the firstgeneration of expert systems regarding the qualityof the implemented knowledge.

The outline of this paper is as follows. First wedescribe some basic aspects of the system CORA:the domain of the system, the expertise requiredby the user as well as the overall system architec-ture of CORA. In the third section we take acloser look at the advice tool. We present theparticular steps of the above mentioned modellingprocess needed to build up this tool. Section 4illustrates the implementation of CORA, that isthe design of the advice tool, which uses theresults of the modelling process, the design of theanalysis tool, which involves all statistical proce-dures to handle the data, and the help system.Some screendumps are presented to show the

Page 3: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 37

layout of the user interface. In Section 5 wediscuss our approach and the lessons learnedwhile developing CORA. Besides we suggest somemain directions for future research to improveand to enlarge our approach.

2. Some basic features of CORA

In this section we describe the domain of thesystem CORA. For developing an appropriatesystem it is also necessary to take a look at thepotential users and their statistical knowledge.Finally, we present the overall architecture of thesystem.

CORA is restricted to a rather small domain,i.e. to the analysis of stratified 2×2 contingencytables for the purpose of evaluating case-controlstudies as outlined in the introduction. However,the domain is hard to handle for a researcher,because different kinds of expertise are required.He/she needs the medical and epidemiologicalbackground to design the study and to collect thedata, but additionally statistical knowledge is re-quired especially while analysing the data.

In a contingency table analysis there are vari-ous decisions to be made by the researcher. Here,the choice of an estimator for the common oddsratio is of particular importance. For this, theresearcher should not only know the differenttypes of estimators, but also their asymptotic andfinite properties. Especially, the latter stronglydepend on the characteristics of the data he/shewants to analyse. If the researcher does not knowthe relationships between the characteristics of thedata and the properties of the estimators, unfa-vourable selections may be the consequence. Here,CORA should assist this selection process byproviding the user with this kind of expertise. Weintegrated this expertise into the help system andespecially into the advice tool of CORA such thatthe system can give a recommendation regardingthe choice of an estimator.

As mentioned before, the advice tool contains aset of learned rules. The system examines thecharacteristics of the present data and, if possible,fires a suitable rule that supposes an estimator.That means, there is no complex inference pro-

cess, but a single stage decision process: only onerule is applied. We should emphasize that thistool has only an advisory capacity, i.e. the user ofthe system is not restricted to follow the recom-mendation of the tool, the final selection is theresponsibility of the user.

We have developed a uniform user-friendlygraphical interface for all components of CORA.The statistical procedures are also part of thesystem: they form the analysis tool. Thus, CORAis not an intelligent interface for an existing statis-tical software package (a so-called front-end) buta stand alone system. The lower part of Fig. 1illustrates the overall system architecture ofCORA.

Fig. 1. Integration of the modelling process into the systemarchitecture of CORA.

Page 4: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5038

3. Knowledge acquisition: a modelling process

The approach presented here emphasizes theprocess of modelling the expertise. The simulationstudies as well as the experts themselves serve asknowledge sources for this process. The acquisi-tion of the expertise is supported by MOBAL [7].This system combines knowledge acquisition andmachine learning, hence additional knowledge canbe derived. First, we have to build up a model, theknowledge base, which represents the results ofthe simulation studies and thus makes them avail-able for MOBAL. From this model represented inthe knowledge base, a characterization of theexamined point estimators can be learned. Thischaracterization, consisting of a set of rules, re-lates the characteristics of the data to the be-haviour of the estimates. The modelling processand the integration of its results into the systemarchitecture of CORA is depicted in Fig. 1. Let usstress that the main goal of using machine learn-ing is not to objectify the knowledge acquisitionprocess, but it is seen as a chance to discover newknowledge and thus, to improve the knowledgebase.

A closer look at the cyclical modelling processreveals different steps. It starts with outlining aframework for the model. In this framework, therelevant characteristics of the data, i.e. theparameter constellations, and the criteria for theassessment of the estimators have to be deter-mined and to be classified. Such a classification isnecessary since it would not be useful to look forrules which are only applicable in very specialsituations. After evaluating the model built up sofar, possible revisions can be made.

From the classified assessment criteria, a suit-ability of an estimator in a certain parameterconstellation can be derived. In addition to thissuitability, we ascertain a ranking for each estima-tor with respect to the assessment criteria. Basedon this ranking, it is possible to state whichestimator is best in a given constellation. Thus itcan later be recommended for a data situationwith similar characteristics. According to the suit-ability of the estimators, the recommendationscan be classified. Such a classification is necessarysince there may be for instance data constellationsin which even the best estimator behaves poorly.

The next step of the modelling process concernsthe representation of the model. Here, MOBALprovides a first order logic, strictly speaking afunction free Horn clause logic. Thus, the items inthe knowledge base are mainly facts and rules,apart from some other representational structureslike rule models (Section 3.3) as well as sorts anda topology of predicates [7]. Because the lattertwo are not relevant for the representation of ourmodel they are not further discussed.

The steps of the modelling process are de-scribed below in more detail.

3.1. Modelling the domain knowledge

Monte-Carlo studies are carried out to investi-gate and compare the performance of some esti-mators. In the simulation study to be evaluated[8], the following six estimators of the commonodds ratio have been examined: the Mantel–Haenszel estimator c. MH, the Woolf estimator c. W,and corresponding jackknifed versions, that is thejackknifed (type I) logarithm of the Mantel–Haenszel estimator J ln (MH)

I , the jackknifed (type I)Woolf estimator JW

I , the jackknifed (type II)Mantel–Haenszel estimator JMH

II , and the jack-knifed (type II) logarithm of the Mantel–Haen-szel estimator J ln (MH)

II . For a description of thedifferent jackknife approaches see for instance [9].

For the design of the simulation study, valuesfor the involved parameters have to be fixed suchas the number of tables, the number of cases andcontrols, and the probability for a control to beexposed. Since we assumed homogeneity, we onlyhave to fix a value for the common odds ratio.This is of special importance because the differ-ence between this value and the estimate obtainedfrom the simulated data can be used to rank theestimators.

Each choice for the values of the parameterscharacterizes a certain constellation. From thesecharacteristics, others can easily be derived, as forinstance the ratio of cases and controls or thedifferences between the extreme values of the ex-posure probabilities in a certain parameter con-stellation. Note that most of the mentionedcharacteristics refer to the single strata. Thatmeans for instance there are ten different numbers

Page 5: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 39

Table 1Categories for the data characteristics and the assessment criteria

Odds ratioS cNumber of tablesSB5Small Exactly 1 c=1

Moderate 55SB10 Small 1Bc52Moderate105SB50 2Bc57LargeLarge 7BcVery large 505S

Number of controlsNumber of cases COCASmall CA55 Small CO55

Moderate5BCA520 5BCO520ModerateLarge 20BCO5100Large 20BCA5100Very large 100BCO100BCAVery large

Gini ratioRatio CO/CA GRRBalancedR51.25 GR=0BalancedMedium balanced 1BGR50.5Medium balanced 1.25BR53Unbalanced 0.5BGR513BRUnbalanced

PProbability of a control to be Differences of the probabilities of a control to be exposed Dexposed

SmallLow D50.2P50.3Centered 0.3BP50.7 Large 0.2BD

0.7BPHigh

M.S.E.Bias MBBB0.005Very small Very small MB0.01

Small0.0055BB0.05 0.015MB0.1SmallModerate 0.15MB1Moderate 0.055BB0.5Large 15M0.55BLarge

of cases in a parameter constellation with tencontingency tables. Thus, the description of thedata characteristics including all tables is verycomplex. To reduce this complexity, we calculatedaverages of the values of the single strata. Asadditional characteristic, we calculated the well-known Gini ratio. This measures how balanced thecharacteristics are across the strata.

As essential criteria for the assessment of theestimators we calculated the means of the estimatedbiases and mean squared errors (M.S.E.) from 1000simulation runs for each parameter constellation.The bias measures the deviation of the estimatefrom the true parameter value, whereas the M.S.E.can be regarded as a measure for the variability.Based on the bias and the M.S.E., a ranking of theestimators can be given where these two criteriahave to be combined appropriately. This aspect isaddressed in the following section.

3.2. Classification of the data properties and theassessment criteria

The rules to be learned for characterizing theestimators should not be too detailed as forinstance:

IF the true, but unknown common odds ratiois 3.5 and there are 30 cases, THEN estimator Xhas a bias of 0.034.

This means, we should use qualitative criteriainstead of quantitative measures. This can berealized by dividing the initial criteria into appro-priate categories. Fixing the bounds of the cate-gories was the most difficult but also one of themost important tasks in modelling this domain.The definition of the categories is arbitrary, but ofcourse the chosen categories strongly influence therules, which are discovered, as well as the complex-ity of the knowledge base.

Page 6: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5040

With the categories presented in Table 1, theabove mentioned rule is subsumed by the followingone:

IF the common odds ratio is moderate and thereare many cases, THEN estimator X has small bias.

The conclusion contained in this rule, however,is not a recommendation. For obtaining the recom-mendations, we have to come back to the rankingand to the suitability of each estimator. The rankingis based on an appropriate combination of the ranksof an estimator regarding bias and M.S.E. Thesuitability is derived by combining the two catego-rized criteria bias and M.S.E. (For clarity, we omitthe details of these approaches for combining biasand M.S.E.) Then, we rate the ranking with respectto the suitability. Hence, if estimator Y is placed firstaccording to its ranking and if in addition itssuitability is very good (i.e. very small values for biasand M.S.E.), the above rule could be formulated asfollows:

IF the common odds ratio is moderate and thereare many cases, THEN we recommend estimatorY which is supposed to perform very well for datawith the above characteristics.

3.3. Representation of the model

3.3.1. Knowledge representation in MOBALThe system MOBAL [7] is an environment for

building up, inspecting and changing a knowledgebase. The items in the knowledge base are repre-sented within a restricted higher-order predicatelogic. The domain knowledge consists mainly of:

(i) facts, expressing relations, properties andconcept membership of objects. They are repre-sented as function-free literals without variables:

pred (Term1,Term2,...,Termn)

(ii) rules expressing inferential relations betweenpredicates, necessary and sufficient conditions ofconcepts, hierarchies of properties. They are repre-sented as Horn clauses in which the premises andthe conclusion literal may be negated:

pred1(Term1(1),…,Termn 1

(1))

&…& predm (Termp 1

(m),…,Termp m

(m))

�predconcl (Term1(concl),…,Termq concl

(concl))

By forward chaining, new facts can be inferredfrom rules.

(iii) rule models (metapredicates): expressingthe structure of the rules to be learned. A rulemodel is a rule in which predicate variables areused instead of actual predicates of an applicationdomain. A predicate variable can be instantiatedby a predicate symbol of the same arity. A fullinstantiated rule model, where all predicate vari-ables have been replaced by actual predicates, is arule.

Further, the domain knowledge can includesome other items like a topology of predicates andsorts, which are not considered here in detail.

3.3.2. Domain modelThe characteristics of the data as well as the

assessment of the estimators are represented asfacts. The following example describes the proper-ties of the first parameter constellation. It contains18 facts with two or three arguments. The thirdplace is necessary for those data characteristicsvarying across the strata.

oddsratio(sit – 1,1.7).number – of – strata(sit – 1,2).diff – prob(sit – 1,0.1).gini – ratio – cases(sit – 1,1).gini – ratio – ratio(sit – 1,0).gini – ratio – prob(sit – 1,1).number – of – cases(sit – 1,1,20).number – of – cases(sit – 1,2,30).number – of – controls(sit – 1,1,60).number – of – controls(sit – 1,2,90).prob(sit – 1,1,0.2).prob(sit – 1,2,0.3).ratio – cc(sit – 1,1,3).ratio – cc(sit – 1,2,3).mean– cases(sit – 1,25).mean– controls(sit – 1,75).mean– prob(sit – 1,0.25).mean– ratio(sit – 1,3).

oddsratio , number – of – strata , diff –prob ,... are predicates concerning the data char-acteristics. The arguments sit – l , sit – 2,...label the parameter constellation of the simulationstudy. They are the first arguments in all predi-cates. Values 1.7, 1.0, 2,... are the arguments inthe last place. They denote the values for the

Page 7: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 41

corresponding data characteristics, e.g. the com-mon odds ratio or the number of strata. Thearguments 1, 2, 3,... in the second place of thethree-place predicates specify to which strata thecorresponding values for the data characteristicsbelong.

The following four facts characterize the assess-ment of the estimators. As above, the argumentsin first place label the parameter constellation.The second place marks the names of the estima-tors, e.g. the abbreviation mh for the Mantel–Haenszel and bl for the Breslow–Liangestimator. The last place marks the values for themean squared error (mse) and the bias.

mse(sit – 1, mh, 1.12436).mse(sit – 1, bl, 1.34329).bias(sit – 1, mh, 0.54718).bias(sit – 1, bl, 0.23197).For all data characteristics and both assessment

criteria, the classification into categories is thenachieved by rules like the following one:

number – of – strata(S,NS) & gt(NS,4) &le(NS,10) �moderate – number – of –strata(S).The above mentioned ranking and the assessed

suitabilities are derived and represented by similarrules. For further information concerning theserules see [10].

If we know the recommendations and the cate-gorized suitabilities then we can assess the recom-mendations according to the suitabilities withrules like the following ones:

recommendation(S,E) & good – suit-ability(S,E) �good – recommendation(S,E).recommendation(S,E) & bad – suitabil-ity(S,E) �bad – recommendation(S,E).

3.4. Learning the recommendation rules

3.4.1. RDTThe rule discovery tool RDT [11], which is

included in MOBAL, helps the user to find regu-larities in facts. It is a model-based learning al-gorithm that induces rules from facts. New factscan be derived using the learned rules. The neces-sary input to RDT are facts and rule models(metapredicates). RDT defines a hypothesis sub-

space that is actually searched via a set of explic-itly spelled out hypothesis templates, the rulemodels. Thus, the hypothesis space consists of theset of all possible instantiations of rule modelswith domain predicates. For efficiently searchingin this hypothesis space, a generalization relationon the set of rule models is defined by suitablyextending the U-subsumption for clauses [12] orthe generalized U-subsumption [13], respectively.According to Buntine, a clause C U-subsumes aclause C % (C]U C %), if the more general clause Ccan be converted to the clause C % by repeatedlyturning variables to constants or other terms,adding atoms to the body, or partially evaluatingthe body of C by resolving some clauses in thebackground knowledge.

This leads to the following definition of thegenerality relationship ]RS between rule models:

A rule model R is more general than R % (R\RSR %), if there exists a substitution s applied toterm variables, and a substitution S applied topredicate variables that does not unify differentpredicate variables such that RsS¤R % . The sub-stitution s turns term variables to constants orother terms and the substitution S renames orinstantiates the predicate variables.

RDT searches this partial ordering top-downfrom the most general to the more specific hy-potheses. Hypotheses are computed by instantiat-ing the predicate variables of the rule models withpredicate symbols. Then, these hypotheses aretested. There are three possible results for thistest:1. the hypothesis is too general, i.e. it covers too

many negative or unknown instances,2. the hypothesis is accepted, or3. the hypothesis is too special, i.e. it covers too

few positive instances.A breadth first search strategy is used and those

hypotheses that have already been accepted orpruned as too special are remembered to avoidexploring their specializations. The specializationsof the former, i.e. accepted hypotheses, are disre-garded because they are redundant, the specialisa-tions of the latter are pruned because they nevercan be confirmed. Only in the case of a hypothesisbeing too general, the search is continued.

Page 8: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5042

The premises of a rule model are incrementallyinstantiated. An order of the premises allows tofurther prune the search of hypotheses also withina single rule model. This order must take intoaccount the bindings of the variables of the rulemodel. Based on the connection of a variable tothe conclusion, a measure for the distance of thisvariable from the conclusion is defined. This mea-sure is then used to determine the premise order.The connection of a variable X to the conclusionvia the relation chain rc(X) is defined as follows:

A variable X occurring in the conclusion of arule model is connected via the empty relationchain (rc(X)=¥).

A variable Xi (15 i5n) occurring in apremise R(X1,X2,…,Xn) is connected via the rela-tion chain rc(Xi)=R $ rc(Xj), if a variable Xj

(15 j5n) n,i" j of R(X1,X2,…,Xn) is connectedvia the relation chain rc(Xj).

A variable can have more than one relationchain, but a rule model which contains an uncon-nected variable is not allowed. The distance of avariable X, denoted by d(X), is then defined as thelength of the minimal relation chain connecting itto the conclusion. Thus, the order of premises canbe defined as follows:

P5P P %, if min ({d(X)�X occurring in P})5min({d(X)�X occurring in P %}).

While instantiating the premises of the rulescheme with respect to this order we instantiate Pbefore P%, if P5P P %. Then we can test all par-tially instantiated hypotheses in the same way aswe test the fully instantiated rule model, if wedrop the uninstantiated premises.

The threshold for too few instances and thusfor pruning the search is computed from a user-specified acceptance criterion. Several primitivesfor this criterion are defined as the cardinalitiesof: pos(H), the positive instances of a hypothesisH; neg(H), the negative instances of H; pred(H),the unknown, i.e. neither provable true nor prov-able false instances of the conclusion which willbe predicted by H; total(H), the total instances ofH; unc(H), the instances of the conclusion whichare uncovered by H; and concl(H), all instances ofthe conclusion. The acceptance criterion is a logi-cal expression of conjunctions and disjunctions ofarithmetical comparisons (i.e. = , B , 5 , ] , \ )

involving arithmetical expressions (i.e. + , − , *,/) built from numbers and the above primitives,for example:

pos(H)\4 & neg(H)B1 & unc(H)B(0.9*total(H)).

For all specializations of a hypothesis H, thenumbers of positive, negative, and predicted in-stances are smaller than those numbers for thehypothesis H itself, whereas the number of uncov-ered instances grows. The number of instances ofthe conclusion does not change while specializingH. Using these relations among the primitives, apruning criterion is derived from the acceptancecriterion. It only prunes hypotheses which cannotbe accepted.

Further effectiveness of the algorithm RDTcomes from the use of a many sorted logic and atopology of predicates [7].

3.4.2. The learning task and its resultsHere, the goal of learning is to gain a character-

ization of the estimators concerning the catego-rized data characteristics. From now on, we onlyconsider those data characteristics that do notvary across the different strata. The remainingnine data characteristics are classified in a totalnumber of 22 categories. The goal predicates tolearn about are the four classified (assessed) rec-ommendations and additionally the unassessedrecommendation. We carried out further learningsteps with some different predicates, e.g. those forthe suitabilities, [10]. To define the hypothesisspace for learning, RDT needs suitable metapred-icates. We used the following ones:

MP1(S,P1,R): S(Est) & P1(Sit) �R(Sit,Est).

MP2(S,P1,P2,R): S(Est) & P1(Sit) &P2(Sit) �R(Sit,Est).

MP3(S,P1,P2,P3,R): S(Est) & P1(Sit)& P2(Sit) & P3(Sit) �R(Sit,Est).

MP4(S,P1,P2,P3,P4,R): S(Est) &P1(Sit) & P2(Sit) & P3(Sit) & P4(Sit) �R(Sit,Est).

While learning, S(Est) is instantiated withpredicates that determine the variable Est in theconclusion, e.g. mantel – haenszel(Est) . Us-ing the first metapredicate we search for datacharacteristics where a single property is sufficient

Page 9: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 43

Table 2Results obtained from the process of learning

AC Results/remarksGoal predicates MP

AllAll No learned rulesMP1Pos\0.9* total (pos= total)very – good – recommendation No learned rulesMP2Pos= totalMP2 One learned rule for JMH

II ; time for learning: 3343 sgood – recommendationPos\0.9* total (pos= total) No learned rulemedium – recommendation MP2Pos\0.9* total (pos= total)MP2 No learned rulebad – recommendation

MP2 Pos= total Eight learned rules for JMHII ; time for learning:recommendation

7137 sPos\0.9* totalMP2 Ten rules; time for learning: 8452 s rules for JMH

IIrecommendationPos\0.9* total (pos= total)very – good recommendation No learned rulesMP3Pos= totalMP3 Nine learned rules for JMH

II ; time for learning:good – recommendation30044 s

Pos\0.9* totalgood – recommendation 18 learned rules for JMHII ; time for learning: 31223 sMP3

Pos= totalMP3 24 learned rules for JMHII ; time for learning: 32678 smedium – recommendation

Pos= totalbad – recommendation One learned rule for c. MH; time for learning:MP329452 s

Pos= totalvery – good – recommendation No learned rulesMP4Pos= totalgood – recommendation �80 rules for JMH

II and JWIMP4

Pos= totalMP4 �100 rules; learning process failedamedium – recommendationPos= total 12 rules; rules for c. MHbad – recommendation MP4

a The learning process was stopped after 1 week.

to recommend an estimator. In the next steps wesearch for combinations of two, three or fourproperties.

In the represented knowledge base, we do notwant to infer new facts from the learned rules.Hence, we consider two acceptance criteria. Thefirst one does not allow predicted facts at all,where the second one allows a maximum numberof 0.1*total for the predicted facts. The mainresults of the learning step are summarized inTable 2.

3.4.3. Selecting a rule setThe goal now is to select a rule set from the

learned rules, which is then integrated into theKBS. As criterium for this selection, we mainlyconsider the redundancy in the rule set. Fre-quently, different rules cover the same data con-stellations. The following example illustrates this:

We consider two rules:(1) mantel – haenszel(mh) & oddsra-

tio=1(S) & large – cases (S) & large –strata(S) �medium – recommendation(S,mh).

(2) mantel – haenszel(mh) & oddera-tio=l(S) & large – cases(S) & balanced –gini – ratio – cases(S) �medium – recom-mendation(S,mh).

These two rules only differ from each other inthe last premise. Both rules cover the first sixparameter constellations of the simulation study.The reason for this is that there is a relationbetween the data characteristics small – strataand balanced – gini – ratio – cases in thesimulation study at issue. This relation is fixed inthe simulation design: in situations with fewstrata, the number of cases is always uniformlydistributed over the strata.

Because of relations like this, redundant ruleshave been learned. Without going into detail wewould like to mention that our choice of only onerule out of the set of redundant rules is based oncertain data characteristics [10].

For the second metapredicate the following fiverules are selected:

unbalanced ratio – CoCa(S) & large –strata(S) �good – recommendation(S,jk)

Page 10: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5044

small – oddsratio(S) & unbalancedratio – CoCa(S) � recommendation(S,jk)

small – prob(S) & small – strata(S) �recommendation(S,jk)

gini – ratio – CoCa– balanced(S) &oddsratio – exactly – one(S) � recommen-dation(S,jk)

gini – prob – unbalanced & small –oddsratio(S) � recommendation(S,jk)

We selected six rules for the thirdmetapredicate:

oddsratio – exactly – one(S) & large –cases(S) & small – strata �moderate –recommendation(S,jk)

small – oddsratio(S) & large –cases(S) & small – strata �moderate –recommendation(S,jk)

moderate – strata(S) & oddsratio –exactly – one(S) & large – cases(S) �good – recommendation(S,jk)

large – strata(S) & oddsratio – ex-actly – one(S) & large – cases(S) �good –recommendation(S,jk)

small – strata(S) & oddsratio – ex-actly – one(S) & very large – cases(S) �good – recommendation(S,jk)

gini – cases medium – balanced(S) &large – difference – prob(S) & large –oddsratio(S) �bad – recommendation(S,mh)

Finally, there are seven rules which lead to arecommendation of an estimator based on a com-bination of four data characteristics.

large – strata(S) & small – oddsra-tio(S) & large – cases(S) & small –prob(S) �good – recommendation(S,jk)

moderate – strata(S) & small – odds-ratio(S) & small – prob(S) & large –cases(S) �moderate – recommendation(S,jk)

moderate – strata(S) & moderate –oddsratio(S) & small – prob(S) & large –cases(S) �moderate – recommendation(S,jk)

moderate – strata(S) & small – odds-ratio(S) & small – prob(S) & very –large – cases(S) �good – recommendation(S,jk)

large – strata(S) & large – oddsra-tio(S) & large – cases(S) & centered –prob(S) �bad – recommendation(S,mh)

large – strata(S) & small – oddsra-tio(S) & large – cases(S) & large – dif-ference – prob(S) �good – recommenda-tion(S,w – jk)

large – strata(S) & large – oddsra-tio(S) & large – cases(S) & small –prob(S) � recommendation(S,jk – ii)

Using the traces for the learning process formetapredicate MP4 and with the help of theexperts, we proposed some possible rules, enteredthem into the system MOBAL and investigatedhow many new facts were predicated by this rule.Thus, the following rules were discovered andaccepted:

small – strata(S) & large – oddsra-tio(S) & very – large – cases(S) & cen-tered – prob(S) &small – difference – prob(S) �good – re-commendation(S,jk)

large – strata(S) & large – oddsra-tio(S) & very – large – cases(S) & cen-tered – prob(S) &small – difference – prob(S) �moderate –recommendation(S,jk)

small – strata(S) & large – oddsra-tio(S) & large – cases(S) & unbalanced –ratio – CoCa(S) & large – difference –prob(S) �bad – recommendation(S,w – jk)

moderate – strata(S) & centered –prob(S) & small – difference – prob(S) &large – oddsratio(S) & large – cases(S) �bad – recommendation(S,jk – ii)

small – strata(S) & large – oddsra-tio(S) & large – cases(S) & medium – bal-anced – ratio – CoCa(S) &unknown(small – prob(S)) �bad – recom-mendation(S,woolf)

The four rules listed above do not predict anynew facts in the represented knowledge base. Ad-ditionally, we discovered a rule with six premises.This rule also does not predict a new fact there:

small – strata(S) & large – oddsra-tio(S) & large – cases(S)& unbalanced –

Page 11: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 45

Table 3Results of the evaluation

40Completeness (%)Correctness (%) 93

NoRedundancy3.6Number of premises

Number of variables 11Number of constants4Covered instances

to their characteristics and selects a rule thepremises which are covered by these ascertainedcharacteristics. This rule recommends an appro-priate estimator for the common odds ratio. Forefficiently encoding the rules, we have to deter-mine an order for querying the data characteris-tics. Additionally, we have to fix an order for therules, because there are data constellations, wheremore than one rule could fire. If there are many ofsuch data situations, it would be recommendableto revise the rule set. The reason why those ruleshave been learned is that the simulation studydoes not cover all combinations of characteristicsthat will occur in practice. The former order isachieved by considering the frequencies of thedata characteristics. The most frequent character-istic, that means the value of the odds ratio, isquestioned first. For every category of this datacharacteristic the frequencies of the remainingproperties are determined and so on. The latterorder (of the rules) is determined by consideringthe number of positive examples for the rules.Table 4 depicts a cutout of the resulting decisiontree(s). We search for an applicable rule (a recom-mendation) top down in the left tree. If we cannotfind any rule in this tree, we start searching in thetree on the right hand side (Table 4).

ratio – CoCa(S) & small – difference –prob(S) & centered – prob(S) �bad –recommendation(S,bl)

3.4.4. E6aluating the rule setWith the help of MOBALs Rule Restructuring

Tool (RRT) [14], we have evaluated the selectedrule set. In Table 3, the results of the evaluationaccording to the criteria completeness, correct-ness, redundancy, the rule length, and the numberof covered instances are depicted.

3.4.5. Integration of the selected rules into theKBS

To develop the advice tool, we integrate theselected rules into the KBS. The advice tool thenanalyses the actual case-control data with respect

Table 4Cutout of the resulting decision tree(s)

Page 12: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5046

Fig. 2. First page of the form for the recommendation.

4. Design

CORA is a Windows application. The userinterface is composed of forms that are createdwith the tool Delphi [15]. Delphi simplified theimplementation of the graphical interface and ad-ditionally allowed to encode the expertise and allstatistical procedures.

In the following sections we give a short de-scription of the design of the advice and theanalysis tool as well as of the help system.

4.1. Design of the ad6ice tool

The advice tool gives a recommendation of anestimator for the common odds ratio. This recom-mendation is based on two data sets, the data tobe analysed and the data of the pilot study, wherethe pilot study could be an external study beingcarried out in advance or a random sample drawnfrom the original data set of size 10%, for in-stance. The data of the pilot study is only used forderiving the recommendation and should not usedfor further analyses. To give this recommenda-tion, three steps are necessary:� the investigation of the two data sets,� the classification of the investigated character-

istics of the data and� the choice of an appropriate rule out of the set

of rules integrated in CORA.

All these steps are within the scope of thesystem, that means, there are procedures to calcu-late the values of the data characteristics, proce-dures for their classification, and a procedure thatimplements the decision tree in form of produc-tion rules by using nested if-then-statements. Ev-ery if-statement corresponds to a premise of a rulethat examines a classified data characteristics. Thecorresponding then-statement represents the con-clusion of the production rules.

While implementing this decision tree the orderfor the rules mentioned in Section 3.4.5 has to betaken into account. Then, the search for an ap-propriate rule stops as soon as the most preferableone has fired.

The interface for the advice tool consists of athree-sided form. On the first page (Fig. 2) theproposed point estimator and a possible bias cor-rection as well as information about the degree ofsuitability of the estimator and the number of therule by which this recommendation is derived arepresented.

The second page (Fig. 3) shows the characteris-tics of the data that the user wants to analyse,namely the number of tables, the mean number ofcases, the balance of the number of cases, themean ratio of cases and controls, and the balanceof this ratio. Both, the exact values and theclassified characteristics are shown.

Page 13: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 47

Fig. 3. Second page of the form for the recommendation.

The characteristics of the pilot data are finallylisted on the third page: the estimated commonodds ratio, the mean probability for a controlbeing exposed and its balance as well as thedifferences between the extreme probabilities.

For all these characteristics the exact valuesand their categorized versions are presented. Thecharacteristics occurring as premises in the ap-plied rule are checked. Thus, all informationfrom the production rules is transparent for theuser. Further information, e.g. the coverage ofthe applied rule, is so far not represented on theform.

4.2. Design of the analysis tool

As mentioned before, CORA is not only anintelligent interface, but includes also the usedstatistical procedures: there are two methods tostratify the data, five procedures for analysinghomogeneity and independence of risk factor anddisease as well as the above mentioned methodsto estimate the common odds ratio, Fig. 4. Addi-tionally, the user has the possibility to get ageneral view of the data. The form that is de-picted in Fig. 5 shows the data as stratified 2×2contingency tables. Repeated calls of this formallow to view several tables simultaneously. Thesecond page of this form shows the correspond-

ing estimates for the individual odds ratios, theirestimated variances, and the confidence intervals.Two other forms show listings of all zero cellsthat are in the set of data and all individual oddsratios. There are two additional forms where de-faults regarding the statistical methods can be set.

4.3. Design of the help system

The help system was created using the Win-dows help compiler. There are two different kindsof topics: on the one hand there are the statisticaland epidemiological topics and on the other handthe user can get information about CORA andinstructions on how to use it. In particular, theuse of the advice tool and its foundations areexplained in this help system.

Every form of CORA contains a help buttonthat allows to jump to a help topic that describesall its components. From there, it is possible toreach the relevant statistical topics.

The system also addresses users with limitedstatistical experience. Hence, the steps and basicconcepts concerning a stratified contingency tableanalysis are explained. The statistical proceduresare described and their finite and asymptoticproperties are stated. Thus, these help topicscomplete the recommendations given by the ad-vice tool.

Page 14: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5048

Fig. 4. Form for the estimation of the common odds ratio.

5. Discussion

Two different aspects were important when de-veloping CORA. On the one hand we analysedMonte-Carlo studies by using AI techniques forobtaining rules to be implemented in CORA. Andon the other hand these rules for recommending acertain estimator in a given data situation and thecomplete system of support formed the essentialpart of our statistical analysis system.

In spite of efficiency problems due to theMOBAL system, past experience has shown thatwith the use of AI techniques we are able toimprove the evaluation of Monte-Carlo studies bymore specific results. This improvement is not

only achieved by machine learning but also bymodelling the knowledge and by examining exist-ing rules with the help of MOBAL.

While working with MOBAL, the domain ex-pert is able to recognize his/her own way ofevaluating Monte-Carlo studies and thus can fol-low the steps of modelling the expertise.

An important benefit from using machine learn-ing is that also more extensive studies can beanalysed without much additional effort. For anevaluation with conventional techniques, thepresent Monte-Carlo study with its 240 investi-gated parameter constellations is already verycomplex. Thus, more extensive studies cannot becarefully analysed with reasonable effort. But es-

Page 15: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–50 49

Fig. 5. Form for the contingency tables.

Of course, the system could further be ex-tended with respect to a support of the user notonly when analysing the data, but also whenplanning the study and collecting the data.

Acknowledgements

Financial support by the Bennigsen-Foerder-Grant of Nordrhein-Westfalen is gratefullyacknowledged.

References

[1] J.B.S. Haldane, The estimation and significance of thelogarithm of a ratio of frequencies, Ann. Hum. Genet. 20(1955) 309–311.

[2] B. Woolf, On estimating the relation between bloodgroup and disease, Ann. Hum. Genet. 19 (1955) 251–253.

[3] N. Mantel, W. Haenszel, Statistical aspects of the analysisof data from retrospective studies of disease, J. Natl.Cancer Inst. 22 (1959) 719–748.

[4] N.E. Breslow, K.Y. Liang, The variance of the Mantel–Haenszel estimator, Biometrics 38 (1982) 943–952.

[5] I. Pigeot, A jackknife estimator of the combined oddsratio, Biometrics 47 (1991) 373–381.

[6] K. Morik, Sloppy modeling, in: K. Morik (Ed.), Knowl-edge Representation and Organization in Machine Learn-ing, Springer-Verlag, Berlin, 1989, pp. 107–134.

[7] K. Morik, S. Wrobel, J.U. Kietz, W. Emde, Knowledgeacquisition and machine learning: Theory, methods andapplications, Knowledge-Based Systems, Academic Press,London, 1993.

[8] I. Pigeot, A simulation study of estimators of a commonodds ratio in several 2×2 tables, J. Stat. Comp. Simul. 38(1991) 65–82.

[9] I. Pigeot, Jackknifing estimators of a common odds ratiofrom several 2×2 tables, in: K.H. Jockel, G. Rothe, W.Sendler (Eds.), Bootstrapping and Related Techniques,Springer-Verlag, Berlin, 1992, pp. 203–212.

[10] U. Robers, Entwicklung eines wissensbasierten Assis-tentensystems zur Analyse von Fall-Kontroll-Studien,Masters thesis, University of Dortmund, Department ofComputer Science VIII, 1995.

[11] J.U. Kietz, S. Wrobel, Controlling the complexity oflearning in logic through syntactic and task-oriented mod-els, in: S. Muggleton (Ed.), Inductive Logic Program-ming, APIC Series 18, Academic Press, London, 1992, pp.335–360.

[12] G.D. Plotkin, A note on inductive generalization, in: B.Meltzer, D. Michie (Eds.), Machine Intelligence, Elsevier,New York, 1970, pp. 153–163.

pecially enlarged studies would be important toimprove the quality of the rule set and hence tolead to rules that cover a wider range of poten-tial data situations. Moreover, the modelling ofthese Monte-Carlo studies has revealed the im-portance to further examine especially situationswith a poor behaviour of the estimators. Thiswould enable the system to give negative recom-mendations that means warnings which estima-tor the user should avoid analysing the givendata.

In addition, the expertise should be completedby rules based on asymptotic properties of theestimators and on characteristics resulting fromthe calculation of the estimators and thus beingindependent of the data structure. Within thescope of this approach it is quite simple to en-large the rule set according to these aspects.

Other possible expansions of the underlyingstatistical expertise concern the fact that CORAonly supports the decision process of the userwith respect to the choice of an appropriatepoint estimator of the common odds ratio. Thepresented approach can, however, also be ap-plied to support other choices of statisticalmethods that have to be made by the user dur-ing the analysis, e.g. concerning an appropriatetest of homogeneity of the individual odds ratiosor a test of independence of risk factor and dis-ease.

In addition to these points, future workshould provide such systems with better meth-ods for graphical representation of data, whichwould increase the insight into the data.

Page 16: CORA—A knowledge-based system for the analysis of case-control studies

K. Morik et al. / Computer Methods and Programs in Biomedicine 58 (1999) 35–5050

[13] W. Buntine, Generalized subsumption and its applica-tions to induction and redundancy, Artif. Intell. 36 (1988)149–176.

[14] E. Sommer, MOBALs theory restructuring tool RRT,

ESPRIT Project ILP (6020), ILP Deliverable GMD 2.2,1995.

[15] Borland GmbH (Ed.), Delphi for Windows 1.0, UserGuide, 1995.

.