Upload
vuonglien
View
214
Download
1
Embed Size (px)
Citation preview
2
EXECUTIVE SUMMARY
As data analysis suffers the curse of missing data, there is a need to come up with
methods that are guaranteed to assist our analysis. A lot of researches have been conducted
decades ago to come up with methods that are helpful for analysis.
The primary objective of this project is to evaluate techniques of analyzing missing
data. Few steps involve in the evaluation such as:
• Understanding behavior or mechanism of missing data
• Studying and analyzing the available methods
• Choosing few methods to be tested
• Coding each method in Matlab
• Running test data set on each method
Reweighting has been chosen to be the method of design. Reweighting introduced
in the project is slightly different from the method available. The approaches utilized in
reweighting method include:
• Minimizing sum of squared residual by applying weights
• Regressing the observed data with the weighted data
Reweighting applies Bayesian Decision Theory, Multivariate Density and Discriminant
Function in the development of it. The result is the tested by using Hypothesis Testing to
compare the reweighted distribution and complete data distribution. Reweighting technique
has been successfully implemented. Nevertheless, its robustness with different types of data
has not been proved. In addition to that, ńaive way of weight assignment has limited its
actual performance.
It is recommended that a better way to produce weights to be implemented. With such
a systematic and professional way of weight assignment, a better result should be expected.
3
ACKNOWLEDGEMENTS
A million thanks to my supervisor, Dr Robert F Harrison for his precious idea and helping
hands during the period of the project.
Not to be forgotten, to my family and friends who has give moral supports in the battle of
completing this masterpiece.
4
TABLE OF CONTENTS
ABSTRACT 1
EXECUTIVE SUMMARY 2
ACKNOWLEDGEMENTS 3
CHAPTER 1 – INTRODUCTION 6
CHAPTER 2 - ANALYSIS WITH MISSING DATA 7
2.1 MISSING DATA MECHANISM 8
CHAPTER 3 – REQUIREMENT AND ANALYSIS 11
3.1 DATA SET 11
3.2 MISSING DATA TECHNIQUES 12
3.2.1 DELETION 13
3.2.1.1 LISTWISE DELETION 14
3.2.1.2 PAIRWISE DELETION 14
3.2.1.3 CASEWISE DELETION 14
3.2.2 IMPUTATION 15
3.2.2.1 SIMPLE MEAN IMPUTATION 15
3.2.2.2 REGRESSION MEAN IMPUTATION 16
3.2.2.3 LINEAR REGRESSION 16
3.2.2.4 LOGISTIC REGRESSION 18
3.2.2.5 GENERALIZED LINEAR REGRESSION 19
3.2.2.6 HOT DECK IMPUTATION 19
3.2.2.7 COLD DECK IMPUTATION 20
3.2.3 REWEIGHTING 20
3.2.4 FULL INFORMATION MAXIMUM LIKELIHOOD 21
3.2.5 MULTIPLE IMPUTATION 21
CHAPTER 4 – METHOD OF APPROACH 22
4.1 BAYESIAN DECISION THEORY 22
4.2 MULTIVARIATE DENSITY 23
4.3 DISCRIMINANT FUNCTION 23
4.4 HYPOTHESIS TEST 24
5
4.5 REWEIGHTING 24
CHAPTER 5 – RELATED WORKS 27
CHAPTER 6 – RESULTS 29
6.1 OVERALL MEAN IMPUTATION 29
6.2 CLASS MEAN IMPUTATION 29
6.3 LINEAR REGRESSION 30
6.4 REWEIGHTING 30
6.5 FURTHER WORKS ON REWEIGHTING 31
CHAPTER 7 – CONCLUSION 45
REFERENCES 46
APPENDIX A
MATLAB CODE FOR OVERALL MEAN IMPUTATION 4 7
APPENDIX B
MATLAB CODE FOR CLASS MEAN IMPUTATION 50
APPENDIX C
MATLAB CODE FOR LINEAR REGRESSION 53
APPENDIX D
MATLAB CODE FOR REWEIGHTING 56
6
1 INTRODUCTION
A complete set of data is crucially needed in order statistical analysis. However, it is
nearly impossible to have “perfect” data sets as missing values have always been a constraint
in analysis. Missing values appears for several reasons for examples non respondents in a
survey, data missing due to intermittent faulty suffers by a machine, ‘wild’ values which are
impossible to be observed being removed and highly cost data being omitted in a data set.
In such cases, we would not be able to come up with reliable and highly confidence
estimations as they affect the properties such as means, variances and percentiles of the data
set. We can opt either to ignore the missing values or to fill in some sensible values for
replacement depending on the mechanisms of the missing data.
Researches have been done earlier on how to analyze a data set which suffered from
missing values. Basically, there are three options available to treat missing data which are
listwise deletion (LD), imputations and maximum likelihood. The first two options are
categorized as complete case analysis while the last is incomplete case analysis. Each option
will be described further in Chapter 3.
The techniques suggested application depends on the mechanisms of data
‘missingness’. Mechanisms include missing at random (MAR), missing completely at
random (MCAR) and missing not at random (MNAR). Mechanisms will be described in
detail in Chapter 2. In the analysis however, missing cases will be assumed to be MCAR.
The goal of this project is to examine the techniques available and the strengths and
weaknesses for each of them. Finally, a technique which is reweighting will be used as the
method of choice. Specific details on design approach will be documented in Chapter 4.
To evaluate performance of different methods, simulations will be done. The analysis
of simulated methods will be illustrated in Chapter 6. Note that only continuous data are
considered in this project.
7
2 ANALYSIS WITH MISSING DATA
Analyzing missing data has become a big challenge in statistical analysis since years
ago. When the presence of missing data is significant, it will results in biased estimation and
can lead to wrong inferences.
Earlier, data set with missing values were analyzed by simply ignoring the missing
values. By pretending that missing samples did not presence, they could be dropped from the
case and consequently reducing the population samples. Traditionally, deletion methods have
been used. Those methods were quite successful in the presence of large amount of data.
Deletion of unobserved data would not cause serious harm to the estimation. This is however
a default way of solving missing data in many statistical packages.
However, in 1987, researchers namely Little & Rubin and Lepkowski, Landis &
Stehouwer started to address better ways of analyzing missing data. The concern was pointed
out due the significant amount of information loss by deletion methods. Imputation methods
were then suggested to fill in the missing values so that the sample size would not be
distorted.
Initially, the simplest mean imputation was fancied among researchers. However, due
to its effect on population distribution, more imputation techniques were studied. The
techniques were cold deck imputation, hot deck imputation, regression imputation and the
most recent, similar response imputation.
Extending the idea of single imputation, multiple imputations method was later being
proposed by Rubin, 1987. Imputations however required complicated computation
especially multiple imputation and therefore reduced the efficiency of the method. Besides,
they also required supplementary data to be introduced.
Later on, due to the complexity of imputation methods, full maximum likelihood
(FIML) has become an attractive alternative for solving missing data. Even so, this method
has not been a successfully practiced. The issue of convergence has caused failure in the
implementation of this method. Furthermore, the convergence failure of binary data has
limited the capability of this method.
8
2.1 MISSING DATA MECHANISM
Missing data in statistical analysis may appear in several patterns depending on the
reasons. A few reasons that contribute to missing data have been introduced earlier. First, in a
survey, there are total non-responsive cases (Figure 2.1) which have been observed in a
sample. It could be due to several causes and hence leading to emptied row of data. Possible
causes of non responsive in a survey are:
• Respondent is absent
• Respondent refuse to answer survey questions due to timeliness
X1 X2 X3 X4 X5
1 x x x x x2 x x x x x3 ? ? ? ? ?4 x x x x x5 x x x x x
Figure 2.1 Total non responsive
In fields where data collection is costly for example getting data from interfaces, it is
very likely that those values will be left out resulting a data set with component partial non
responsive. This therefore creates an empty column in a data set as illustrated in Figure 2.2. It
is also known as univariate non response. For multivariate non response, certain variables are
missing for certain cases as in Figure 2.3.
X1 X2 X3 X4 X5
1 x x x ? x2 x x x ? x3 x x x ? x4 x x x ? x5 x x x ? x
Figure 2.2 Partial Non Responsive Univariate Patterns
X1 X2 X3 X4 X5
1 x x x ? ?2 x x x ? ?3 x x x ? ?4 x x x x x5 x x x x x
Figure 2.3 Partial Non Responsive Multivariate Patterns
9
There are also partial non responsive cases (Figure 2.4) where only certain items are
missing in a data set for certain variables. This possibly due to removals of weird, “wild”
values or respondent do not response to all questions.
X1 X2 X3 X4 X5
1 x ? x x x2 x x ? x x3 x x x x ?4 x ? x x x5 ? x x x ?
Figure 2.4 Partial Non Responsive Haphazard Patterns
The pattern of missing data will be monotone when case/cases is/are removed for the
subsequent variables. This normally occurs in attrition of longitudinal survey. Pattern of
monotone non responsive is as in Figure 2.5.
X1 X2 X3 X4 X5
1 x x x x x2 x x x x ?3 x x x ? ?4 x x ? ? ?5 x ? ? ? ?
Figure 2.5 Non Responsive Monotone Patterns
There are three mechanisms for the occurrence of missing data, two of which are
missing at random and the last to be not missing at random. Little and Rubin [1] however
made a distinction between missing completely at random (MCAR) and missing at random
(MAR).
Missing case is categorized as MCAR if it satisfies the condition; probability of
missing is independent on any value of any variable. Mathematically, it can be expressed as
P(Y|y missing) = P(Y|y observed) where Y is the random variable under studies. Since
“missingness” is uncorrelated to any variables, the distribution of Y is not affected by the
missing values.
In the case of MAR, P(Y|y missing,Z) = P(Y|y observed,Z), which means that the
probability of y missing is equal to probability of y observed for Y Є Z. In other words, the
probability of Y is missing is not correlated to the value of Y after controlling observed
variables, Z. Again, the distribution of Y is not being altered for Y Є Z, where Z is a set of
variables. It is obvious that MCAR is a stronger condition rather than MAR.
10
If neither condition of MAR nor MCAR is hold, the “missingness” is considered as
MNAR. As the name has suggested, the pattern of missing is not random and is associated to
the variable on which the data is missing.
11
3 REQUIREMENTS AND ANALYSIS As discussed earlier, there are a number of techniques to treat missing data. How well
it works depends on where it is being used. For example in field where large number of data
is a constrained, deletion methods and full maximum likelihood might not bring up a good
result. Even the results might not be misleading; level of confidence on the inference made is
a bit low or less reliable.
As decision is crucial in industries like health monitoring and plant control, the need
of producing a highly accurate estimate is a big challenge since it is very unlikely that a
perfect set of data will be present. Any misleading would hardly be tolerated as it deals with
life and will be costly. As such, the most appropriate technique has to be used to analyze
such cases.
A few techniques namely overall mean imputation, class imputation, linear regression
and reweighting will be used and each result will be compared. How well each method works
depend on the classification result as compared to complete data result. The wellness is
determined by the classification result after applying the techniques as compared to the
complete data result.
3.1 DATA SET
A set of data is created using MATLAB random number generator. For consistency,
the random generator is initially set to run at a specified state.
To reflect the real case, data was generated to be multivariable. For analysis purpose,
only three variables data is generated with 500 cases. Data distribution was assumed to be
multivariate normal Gaussian with same mean and variance. The variance of the data set
plays a significant contribution in our analysis later.
The data was partitioned into two classes, Class 1 and Class 2 for simplicity. The first
250 cases are categorized as Class 1 and the remaining as Class 2. Here, Bayesian Decision
Theory is being applied to determine the decision boundary. The theory will be covered in
detail in the next chapter.
In order to create missing cells in the data set, once again random generator is used to
give missing at random mechanism. As we will see afterward, the percentage of missing data
12
has to be considered as well in technique selection. As the missing cells are randomly
distributed, the amount of missing data in each class might not necessarily being the same.
The missing cells were first assigned to be NaN (Not a Number). Those cells were
then manipulated according to the techniques applied. Different techniques approach the
missing cells differently. Deletion techniques will drop the case with missing data on the
variable being studied and as a consequence the sample size will become smaller.
Imputations on the other hand, fill in the NaN cells with predicted values computed from
variety of techniques. Reweighting altered the data resulting from listwise deletion by
assigning certain weight to the data. The alteration is based on minimizing the squared
residuals.
3.2 MISSING DATA TECHNIQUES
The way missing data being handled determine the quality of our analysis whether it
will lead to the predicted solution or vice versa. If the prediction made is wrong or biased, we
prone to be in trouble especially in cases like analyzing health data for patients at a hospital.
Wrong prediction might cause reputation to be in alarming state and patient’s state to be in
danger. Therefore in such critical area, missing data must be dealt with care so that prediction
will be as closed as possible to the expected solution.
There are several ways to handle missing data. We can opt either to remove them
completely from the data set or imputing values to those missing holes to get a complete data
set. Both options are known as complete case analysis. In oppose to complete case analysis
we are able to analyze data with missing values directly without having to complete the data
set. Incomplete case analysis is model based approach while the complete case analysis is
sampling based.
The effectiveness of the techniques is determined by how much the outcome has
deviated from the complete case data. The percentage of deviation for the complete case
mean will be a performance indicator. In the case of applying imputations, if the mean after
data imputation is close to the complete case mean, we should be expecting that the values
imputed are closed to the actual values. The classification result using the data set with
imputed values should be similar to the result with complete data analysis. The same
13
inference goes to other methods as well. The closer the resulting data to the complete case,
the lower the likelihood that the classification will be misleading.
The way missing data being handled determine the quality of our analysis whether it
will lead to the predicted solution or vice versa. If the prediction made is wrong or biased, we
prone to be in trouble especially in cases like analyzing health data for patients at a hospital.
Wrong prediction might cause reputation to be in alarming state and patient’s state to be in
danger. Therefore in such critical area, missing data must be dealt with care so that prediction
will be as closed as possible to the expected solution.
There are several ways to handle missing data. We can opt either to remove them
completely from the data set or imputing values to those missing holes to get a complete data
set. Both options are known as complete case analysis. In oppose to complete case analysis
we are able to analyze data with missing values directly without having to complete the data
set. Incomplete case analysis is model based approach while the complete case analysis is
sampling based.
The effectiveness of the techniques is determined by how much the outcome has
deviated from the complete case data. The percentage of deviation for the complete case
mean will be a performance indicator. In the case of applying imputations, if the mean after
data imputation is close to the complete case mean, we should be expecting that the values
imputed are closed to the actual values. The classification result using the data set with
imputed values should be similar to the result with complete data analysis. The same
inference goes to other methods as well. The closer the resulting data to the complete case,
the lower the likelihood that the classification will be misleading.
3.2.1 DELETION
Deletion or data removal is one of the techniques to achieve complete data set. This is
the easiest to be done and most commonly set as default technique for statistical software
packages. Removing data will reduce the amount of data available for analysis. If the sample
is small, the amount of data after elimination will be getting smaller. Though data deletion
appears to be simple and less time consuming, insufficient data will have higher tendency of
14
inaccurate estimation especially when the sample size is small. Thus, this is neither practical
nor intelligent way of handling data sets with missing values.
Deletion can be done in several ways including listwise deletion, pairwise deletion
and casewise deletion.
3.2.1.1 LISTWISE DELETION
Listwise deletion is a method of data removal where the entire variable/variables
having any missing data is/are being dropped from analysis. This obviously causes a
significant amount of information loss if the pattern of ‘missingness’ is haphazard random.
However, this naïve method of data treatment still has an advantage if the missing
values occur at random. It does not suffer from biases estimates as the whole case is dropped
rather than making ‘wild’ substitutions to the missing holes.
3.2.1.2 PAIRWISE DELETION
Pairwise deletion works in the sense that if a case has missing values for any variable,
it will automatically be dropped out from calculation relating to those missing variables. It is
named such since the correlation of each pair of variables are calculated using only cases
with complete data for the pair of variables. This method of deletion leads to different
covariance matrix sizes for different pairs of variables since different cases are deleted for
calculation.
3.2.1.3 CASEWISE DELETION
For casewise deletion, cases having missing values in variables of interest are totally
excluded in the analysis. Therefore, correlations are calculated only using cases that have
been selected for analysis.
It is possible to end up with nothing if the missing values are randomly distributed
across all variables. Casewise deletion decreases the amount of data available for calculation.
This will cause misleading in analysis especially for a small sample size due to insufficient
amount of data.
15
3.2.2 IMPUTATION
Imputations appear to be more appropriate in effort of analyzing data with missing
values. By imputing values into missing holes, statistical tools available can be used in
analysis as complete set of data are now available. Imputation if properly applied to the cases
will produce a reliable set of data for analysis. The problems we might face are how to
impute values which mimic the true values which are missing and are those values truly
represent the blanks?
In the earlier years of imputation, the easiest way to impute missing data was by a
single value substitution or single imputation (SI). Later in 1987, Rubin has introduced
multiple imputation (MI) method which substituted the holes with a set of likely values for
them. MI used the combined results of several SIs to get the final value to be imputed to the
missing cell.
There are few ways commonly used to get the substitutions namely mean imputation,
regression imputation, cold deck imputation, hot deck imputation and similar response
pattern imputation (SRPI). Each of this technique will be further discussed next.
Imputation in general suffers from distribution distortion as variance will be affected.
This means that good estimation on individual values does not guarantee good overall
estimation on the parameters under study. This however can be encountered by taking class
mean imputation or multiple regression imputation.
3.2.2.1 SIMPLE MEAN IMPUTATION
Simple mean imputation is done by substituting missing value with the mean of the
variable of interest. This however is not a recommended way to impute values as it will cause
the distribution of the variable to shrink and the variance to be wrongly estimated. It is the
easiest way of imputation but does not guarantee correctness in the inferences made.
Mean imputation can be further classified into two; overall mean imputation and class
mean imputation. Overall mean imputation substitutes the missing values with the overall
data mean whereas class mean imputation replaces those with class mean.
Class mean imputation is an improved version of overall mean imputation with target
to reduce data variance shrinkage. Besides, class mean imputation promotes unbiased
estimate of mean.
16
3.2.2.2 REGRESSION MEAN IMPUTATION
Regression mean imputation is a better way to impute values rather than simple mean
imputation since the information on the joint distribution between the variables of interest is
utilized. There are three ways of regression that can be performed; linear regression, logistic
regression and generalized linear regression. Linear regression is suitable for continuous
variables while logistic regression on the other hand works for binary output data. In the case
of multiple unordered categories, generalized linear regression appears to be an attractive
choice.
The first thing to be done for regression imputation is to carry out regression of the
incomplete variables to an auxiliary variable. Auxiliary variable refers to variable which has
information prior to sampling. In this manner, the information of the joint distribution
between the auxiliary variables and variables of interest is taken into account. Precise
regression model will lead to unbiased estimates of the mean. Nevertheless, this method still
suffers from misleading inferences due to inaccurate regression coefficients as a result of
small variability of imputations.
3.2.2.3 LINEAR REGRESSION
Linear regression is about fitting two or more variables to fit a linear equation which
will minimize the sum of squared residuals. The best fit will give the smallest value of r
which means that the line fits most of the data. The missing values are then predicted from
the line. Consider the case of two variables which satisfy the linear equation as follows:
(3.1)
where Y is the variable of interest or dependent variable
X is the auxiliary or independent variable
a, the intercept of the linear equation is the predicted mean for variable Y. From equation
(3.1), we can deduce the equation to calculate a as below:
aveave bXYa −= (3.2)
b, the regression coefficient determine the association between variable X and Y. Positive b
indicates positive association between X and Y, that is increasing X will cause an increase in
bXaY +=
17
Y. Negative association on the other hand implies that an increase in X resulting in a
decrease in Y. b takes the value between -1 and 1. b equals zero means there is no association
between X and Y while b equals 1 indicate the best correlation between them. b is calculated
using the formula below:
22
x
xy
S
C
x
xyb =
ΣΣ= (3.3)
where xyC is covariance of X and Y and2xS is the variance of X.
Once the equation of linear regression has been obtained, we need to analyze how
well the data fit the line. Therefore, r which determines the degree of the linear regression
calculated is introduced. r is mathematically expressed as:
yyxx
xy
SS
Sr = (3.4)
xyS is the covariance of X and Y, xxS is the variance of X andyyS is the variance of Y.
Perfectly linear fit occurs when r = 1 whereas r = 0 means that there is no linear fit between
the dependent and independent variables.
The idea of linear regression can be further extended to multiple linear regressions.
This is done by introducing more independent variables to be regressed with the dependent
variable. In this sense, we will be able to minimize the sum of squared residual. The best
model will be determined using all the regressions performed.
To construct multiple linear regressions, equation (1) has to be expanded to include
more independent variables. The equation for multiple regressions is:
nn XbXbXbaY ++++= ...2211 (3.5)
Calculation of b1, b2,..., bn,is similar to the calculation of b shown earlier.
Below are the illustrations for Linear Regression and Multiple Linear Regressions.
18
Figure 3.1 Linear Regression
Figure 3.2 Multiple Linear Regressions
There is limitation on the linear regression discussed above. It can only be used to
analyze single dependent variable. With such limitation, generalized linear regression has
been opted for. Details of generalized linear regression will be addressed in section 3.2.2.5.
3.2.2.4 LOGISTIC REGRESSION
Instead of using linear regression, logistic regression has been used for data with
dichotomous outcomes. Logistic regression gives a better flexibility compared to linear
regression as it exhibits the S shape from the sigmoid function:
)exp(1
)exp(
Y
Yp
+= (3.6)
p generated from equation (6) is the probability of the value of the variable being predicted
and hence takes a value between 0 and 1. Substituting equation (5) into (6) gives:
19
)...exp(1
)...exp(
2211
2211
nn
nn
XbXbXba
XbXbXbap
+++++++++
= (3.7)
Again, values of b1, b2,..., bn are calculated the same way as for linear regression. From the
logistic regression model obtained, ones can predict values to fill up the missing holes. Since
logistic regression works for binary or dichotomies outcomes, the predicted values will take
the probability of the dependent variable, Y to be 1. Figure 3.3 below shows the illustration
of logistic regression model.
Figure 3.3 Logistic Regression
3.2.2.5 GENERALIZED LINEAR REGRESSION
Generalized linear model approach seeks to improve the limitation of multiple linear
regressions by allowing more dependent variables to be analyzed in parallel. The procedures
of producing the regression coefficients are similar to those discussed in linear regression
section. It only differs in the sense that the dependent variable, Y is now a matrix instead of a
vector and the regression coefficients are tabulated as a matrix now.
The advanced ability of this model as compared to multiple linear regressions enables
combination or transformation of multiple linear dependent variables. Compared to multiple
linear regressions which are a univariate method, generalized linear regression is a
multivariate method which uses the correlation information of the dependent variables.
3.2.2.6 HOT DECK IMPUTATION
In contrast to mean imputation, hot deck imputation technique does not distort the
distribution of the sample. This is because different observed values are substituted to the
missing holes instead of the population mean. The idea of hot deck imputation method is
20
based on drawing a prediction from current observed set of values or model, called donor of
the same variable that is almost similar to the set with missing data or the client. The values
to be imputed are drawn from the donor using based on methods like nearest neighborhood
imputation, similar response pattern imputation (SRPI), longitudinal imputation and cross
sectional imputation. Updates on the emptied cells result in continuous updates on the sample
as well.
The following procedures describe the steps to apply hot deck imputation:
1. Determine the initial deck or donor to be used.
2. Select sample cases.
3. Categorize sample into subclasses.
4. Amend the hot deck values to reflect subclasses by choosing records with complete
observations.
5. Substitute missing values with chosen values from donor and update the hot deck
value.
3.2.2.7 COLD DECK IMPUTATION
Cold deck imputation on the other hand utilizes previous data of similar cases or
survey observed to get the values for imputations. Example of cold deck imputation is using
previous year data to impute the missing gaps. Since the deck data is from history, therefore
it is not being updated every time missing data is filled up. This signifies that deck values are
always static or fixed. All the procedures described in hot deck imputation applied but the
fifth one.
3.2.3 REWEIGHTING
The essence behind reweighting is similar to regression that is to produce a data set
that minimizes the sum of residual squared error. The introduction of weight column in a data
set helps to achieve such aim.
The rule of the weight introduced is to attenuate high value residuals while keeping
the lower residuals unchanged. Once reweighting has been completed, the original data set
will be regressed with the new set. The same procedures of regression will be applied here.
21
3.2.4 FULL INFORMATION MAXIMUM LIKELIHOOD (FIML)
There is one common goal for all the methods described beforehand that is the
methods attempt to get a complete set of data for analysis. The difference is just the approach
to achieve the goal. Section 3.1 opts for deletion of missing values from the data set whereas
section 3.2 methods achieve the goal by imputing values to fill in the gaps.
FIML approach is to maximize likelihood function of a model given the observed
data with assumption that data distribution is multivariate normal. An attractive advantage
that FIML offered is that it would not lead to a biased estimation regardless of the number of
missing values. However, this in return demands a relatively large number of data sets. Other
than this drawback, FIML is considered to be impressive as it still accept even the data sets
that do not fully exhibit criteria of multivariate normal distribution.
3.2.5 MULTIPLE IMPUTATIONS
The idea of multiple imputations comes from the fact that there are several possible
values that ones can impute. By taking the combinations of all the imputations performed as
the actual value, we are accounting for all the uncertainties present in the prediction of the
missing values. Generally, multiple imputations repeat between two to five times single
imputations to produced predicted data sets. Upon completion of the multiple imputations,
we would able to figure the best inferences for the missing values.
Few methods introduced earlier such as mean imputation, regression imputation may
be performed with multiple imputations. However, SRPI and multiple imputations might not
be an appropriate combination as SRPI will draw the same value in each imputation.
22
4 METHOD OF APPROACH
This project was mainly concentrate on continuous data analysis. Throughout the
project, a few imputation techniques and reweighting have been evaluated to demonstrate
how well they have performed in the presence of different percentage of missing data.
Analysis could either be one stage process or two stages process. One stage process
only involved imputations procedure while two stages process had an additional
preprocessing stage.
Analysis of overall mean imputation and class imputation were one stage process
while regression imputation and reweighting will be two stages. Each method was tested with
different values of data variance to observe the degree of robustness. Missing data percentage
was varied as well for the same purpose.
Prior on thorough discussion of the methods, section 4.1, 4.2 and 4.3 will cover the
elementary knowledge used in the development of this project. It includes Bayesian Decision
Theory, Multivariate Density and Discriminant Function.
4.1 BAYESIAN DECISION THEORY
Bayesian decision theory is one of the methods applied to deal with pattern
classification problem. It makes used of Bayes Theorem to accompany a decision. The
essence of Bayes Theorem is:
‘Given certain knowledge or priori information that an event will occur and likelihood of the
event to occur, the conditional probability of the event to occur with a given a set of data can
be predicted.’
Mathematically, the expression is as such:
evidence
priorlikelihood
xp
wPwxpposteriorxwP jj
j
×=×
=)(
)()|(),|( (5.1)
Let consider a simple example to predict the absence of a student in a class for a particular
day. It is known that the student will be absent if it is raining with probability a. Given that
the probability that it is raining on that day is b. With such information, we have to predict
the probability that the student will be absent on that day. In this case, we have to make a
decision out of two possibilities, which is either absent or present. The decision will be an
23
absence on Wednesday if the posterior of the absent, P(absent|Wednesday) >
P(present|Wednesday). Otherwise, the decision will be a present on Wednesday.
It is very likely that the decision made is wrong at times. This occurs when:
P(error|x)= P(ω1|x) decide ω2
P(ω2|x) decide ω1
Such error has to be minimized in order to avoid wrong prediction. To minimize the error, we
have to decide ω1 when P (ω1|x)> P (ω2|x) and vice versa. This hence caused the error
equation to become:
P(error|x)=min[P(ω1|x), P(ω2|x)] (5.2)
Since prior knowledge of the data set is needed to apply Bayesian Decision Theory, a
prior probability of 0.5 has been used in the analysis.
4.2 MULTIVARIATE DENSITY
As mentioned earlier, throughout this whole project, data was assumed to be random
multivariate Gaussian with dimension, d. For simplicity, d=3 is used for analysis. Thus, the
normal multivariate density can be expressed as:
−∑−−∑
= − )()(2
1exp
||)2(
1)( 1
2/12/µµ
πxxxp t
d (5.3)
where x is the d-component column vector, µ is the d-component mean vector, ∑ is the d-by-
d covariance matrix, ||∑ is the determinant of the covariance matrix and 1−∑ is the inverse
of the covariance matrix.
4.3 DISCRIMINANT FUNCTION
Discriminant function is one way to perform classification. For a dichotomizer case,
we can easily classify the data with a single discriminant function, g(x) where:
)()()( 21 xgxgxg −= . Using the discriminant function, decision is ω1 if g1 > g2 and vice
versa.
gi(x) can be defined as P(ωi|x), that is probability of getting ωi given x. Using Bayes
theorem, )()|()()|(
)()|()|()( ii
jj
iiii Pxp
Pxp
PxpxPxg ωω
ωωωωω =
∑== . We can now represent
24
)(xg as: )|()|()( 21 xPxPxg ωω −= . Taking natural logarithm, ln for the right hand side of
)(xg gives:
)(ln
)(ln
)|(ln
)|(ln)(
2
1
2
1
ωω
ωω
P
P
xp
xpxg += (5.4)
This equation will be used to determine the class of a tested value. The test will be done on
both complete and altered data to compare the outcome. From equation 4.3, we can thus
manipulate equation 4.4 into:
)(ln2
1
2
1)( 111
iiit
iiiit
i Pxg ωµµµ +∑−∑+∑−= −−− xxx (5.5)
Equation 4.5 is used as the discriminant function in the analysis. A vector, x of dimension 1-
by-d will be used to verify that the classification with predicted values on the missing holes
matches the classification result with complete data. Given a vector of test value, x if g1(x) >
g2(x) means test value belongs to class 1 and vice versa.
4.4 HYPOTHESIS TEST
Hypothesis Testing is an essential tool in statistic analysis. It can be used to test the
mean of two distributions. In order to begin with hypothesis testing, a hypothesis has to be
defined.
Null hypothesis represents a statement which is believed not to be true but somehow
it could not be rejected. In this case, null hypothesis is that new distribution is not the same as
the distribution of complete data set.
By using ttest2 command in MATLAB, we would be able to establish hypothesis
testing in this project. The results of the hypothesis testing are confidence level and
confidence intervals which will help us in making inferences.
4.5 REWEIGHTING
Reweighting was the main method of interest in this project. The other methods used
in simulation just served as comparison purpose. Major works being done for this project
were related to this method. Even though the effectiveness was not well proven, the
approaches of this method were attractive. If it is well-developed, it might become a useful
technique of analyzing missing data.
25
Reweighting comprises of producing a new data set and regression. As it was a two
stages process, preprocessing took place to remove NaNs from the data set which means that
analysis will be performed only with observed data. Having the first stage completed, a set of
rules were then established to define weight assignments for the data set.
The purpose of weight assignments was to weight the data while keeping the
distribution very much similar to the original. With those weights, sum of squared residual
would be reduced which pointed out that the data was closer to the true value.
Residual was defined as the deviation of the observed value from the mean of the
data. Total residual of all observed samples should be equal to zero since the number was
random with specified mean and variance. A minimum sum of squared error guaranteed a
good fit with mean line.
Residuals with significant values were penalized by appointing lower weights while
the lower residuals will be kept unchanged or given a weight of one. The rules used in this
analysis were as the followings:
• Weight = 0 for missing data and 0.8 < ratio of residual ≤ 1.0
• Weight = 0.25 for 0.6 < ratio of residual ≤ 0.8
• Weight = 0.5 for 0.4 < ratio of residual ≤ 0.6
• Weight = 0.75 for 0.2 < ratio of residual ≤ 0.4
• Weight = 1 for 0 ≤ ratio of residual ≤ 0.2 Ratio of residual was simply the ratio and the standard deviation. There is no specific
calculation on the selection of the weights above. The most important guidelines of weight
selection are applying those weights will minimized the sum of squared residuals and the
distribution remained unchanged. It is believed that thorough weight computations will show
a better performance.
By multiplying the weight with the squared residual for each sample, we obtained a
new set of squared residual. Taking the square roots of each gave a new set of residuals and
hence calculation for new set of data was performed. Next, the preprocessed data was
regressed with the new set of data producing another new data set. This new data set was the
result from reweighting and regression processes of manipulation. Subsequently, the analysis
was continued by computing the means and covariance matrices for each class.
26
Techniques’ performance was assessed by adding a test value to verify any
misclassification. With the flexibility of the code written whereby the user has to enter test
values when being prompted, testing more values during technique assessment was less
hassle.
27
5 RELATED WORKS
Reweighting method has appeared in several documentations for example by Karla
Nobrega and David Haziza. The idea proposed by them basically similar to the idea of
reweighting presented here. The concepts are:
1. Only consider observed data.
2. Adjust the sampling weights of observed data compensate for the deleted cases.
The weight assignment presented in this project is just a simple guessing whereas the
assignment suggested by Nobrega and Haziza is more appropriate. In their citation, to begin
with reweighting procedure, a model needs to be estimated to reflect the unobserved data.
The weight is then computed using the following formula:
∧=i
ii
p
ww *
Hence the estimated total population model *Y will be:
ip is the estimated probability distribution, iw is the initial weight of the data and iy is the
observed data. ip ≈ ip guarantees an unbiased estimation of population Y.
In the case of the reweighting method presented in this project, as stated earlier, there
is no calculation being done to get the appropriate weight. Using such a simple intuitive
assignment, analysis has been conducted and hence, expectation on the performance should
not be much.
Another major difference with the reweighting in this project is that it also utilizes the
weights to the residuals of the observed data. The weighted residual is then used to predict a
new set of data. This new set of data will be used as a regression model to find the
coefficients of regression. The original observed data is finally regressed with the weighted
new data.
As far as reweighting is concerned, techniques introduced by Nobrega and Haziza
have been one of the established methods of reweighting. They have agreed that reweighting
method has advantages over other methods like imputation in the sense that it does not
∑∑∈∈
==rr si
ii
i
siii y
p
wywY
ˆˆ **
28
promote creation of artificial data. Besides it simplicity, the availability of a number of
software to compute estimation reduces the timeliness issue on estimating population model.
However, reweighting cannot escape from the curse of dimensionality. If large
number of variables present and the missing mechanism is item non responsive, a large
number of adjusted weights have to be worked out.
Another citation about reweighting was found in Newborn Lung Project in University
of Wisconsin, USA. The method presented was similar to the previous Nobrega’s and
Haziza’s. Weight is simply the inverse of probability of observed data. In the project,
performance of reweighting is measured by calculating the deviation of the reweighted data
from complete data case. It was found out that reweighted data has not performed well in the
sense of mean deviation but vice versa for variance deviation. Overall, data reweighting has
not been of choice in that study.
29
6 RESULTS
6.1 Overall Mean Imputation
The performance of Overall Mean Imputation in the presence of 20% missing values
is presented by Table 6.1. The performance is measured by how much the mean has deviated
from the actual complete data case with varying values of variance.
From the table, it can be concluded that in the presence of 20% missing data, mean of
imputed data deviated between 2.5 % to 6.5% as compared to the complete case. This is
considered an acceptable range as there is no misclassification occurred.
An increase of the missing values to 40% however, showed a poor performance. The
result is as indicated in Table 6.2. From that table, it can be concluded that the deviation from
the complete case was too severe and hence leading to misclassification on all values of
variance being tested. Percentage of deviation of mean with 40% missing ranged from 12%
to 24%.
At this point, we should be expecting that with higher percentage of missing data,
performance of overall mean imputation method would be poor as well (refer to Table 6.3).
From the above detail, generally, overall mean imputation method does not guarantee
reliable imputed values. This is critical especially when the data is not well separated, that is
mean of class 1 and classes 2 are closed to each other. As mentioned earlier, the resulting
distribution will be shrunk as the mean value is used to replace the missing ones. If more
mean substitution takes place, the variance shrinkage will be more severe.
The last three rows in the table shows the result of hypothesis testing of the new data
and the original data. A value of 1 for h indicates that null hypothesis can be rejected at the
confidence level and between the confidence interval specified. This means that if a test
value chosen is in the range of confidence interval specified, the result of analysis with
complete case should match the result after applying overall imputation technique.
6.2 Class Mean Imputation
As discussed in previous chapter, class mean imputation is a better option for mean
imputation. Even though it still results in a decrease in variance, it is much less severe as the
30
class data is taken into account. This is indicated by the percentage of error in the classes
mean in Table 6.4, 6.5 and 6.6.
From those tables, it can be concluded that the deviation of the data with imputed
values and the complete case data is in the range between 0.5% and 3.5%. This indicates that
this method produces reliable predictions to fill in the missing gaps. However,
misclassification still occurred for missing percentage of 40% and 50%. This would not
happen if class 1 and class 2 are well-discriminated.
6.3 Linear Regression
Performance shown by linear regression method is as shown in Table 6.7 to 6.9. It
shows no misclassification even in the presence of a large number of missing data.
In terms of percentage of mean deviation, for 20% missing data, the deviation ranged
from 0.5% to 10%. For 40% missing data, the deviation took the interval from 10% to 18%
while for 50% missing data the interval was between 19% and 32%. This is quite an
interesting finding. It shows that percentage of mean deviation is not a good indicator for
missing data techniques performance. Even the percentage of deviation showing a
considerable value, misclassification does not occur.
6.4 Reweighting
The results for evaluation of reweighting method are as illustrated by Table 6.10 to
6.12. The results strengthen the idea in linear regression that percentage of mean deviation
does not serve as a measure of techniques performance.
The situation for reweighting is far more complex to be understood. In Table 6.11 and
6.12, the result with low variance showing misclassification is contradicted with the
expectation mentioned earlier. It is noticed that at lower variance, the percentage of mean
deviation is high and the value is reduced significantly when the variance of the data is
increased. Theoretically, it is understood that with higher variance, misclassification tends to
occur. At this stage, this should be leave as a puzzle that remains unsolved.
With 20% missing data, the percentage of deviation varies between 0.5% and 9% for
data variance higher than 1. At those variances as well, there is a very slight increase
observed of percentage of deviation when the missing data is at 40% and 50%. Percentage
31
between 25% and 58% is observed for variance equals to 1 for all 20%, 40% and 50%
missing data.
As proved by the tables of results, this method does not seem to be the best. However,
if it is given more technical touches in weight assignments, it would have perform at least
equivalent to linear regression.
6.5 Further works on Reweighting
With the extraordinary discovery in reweighting as discussed formerly, it should be
beneficial to work further on it. One obvious thing that should be reworked on is the weight
assignments. Since in this project weights were assigned intuitively, it can be further
improved by introducing a more technical and systematic way of weight calculation. By such
a professional way, the results might turn out to be more impressive. Thorough research in
statistics might help in this case.
Another area that has not been covered is performance of this method for binary data.
This would be the critical area to be evaluated as we could see how robust reweighting
method is to different types of data.
32
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.3014 0.29114 0.32762 1.507 1.4557 1.6381 3.014 2.9114 3.2762 4.521 4.367 4.9143Standard
Deviation0.26053 0.26533 0.25373 1.3026 1.3267 1.2686 2.6053 2.6533 2.5373 3.9079 3.98 3.8059
Percentage of Mean Error
4.17% 4.78% 0.50% 4.17% 4.78% 0.50% 4.17% 4.78% 0.50% 4.17% 4.78% 0.50%
Mean 0.3124 0.29128 0.31772 1.562 1.4564 1.5886 3.124 2.9128 3.1772 4.6861 4.3693 4.7659Standard
Deviation0.2579 0.25414 0.25699 1.2895 1.2707 1.2849 2.579 2.5414 2.5699 3.8686 3.8122 3.8548
Percentage of Mean Error
3.79% 3.81% 4.97% 3.79% 3.81% 4.97% 3.79% 3.81% 4.97% 3.79% 3.81% 4.97%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 0 0 0 0 0 0 0 0 0 0 0 0Significance
Level of confidence
0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.769 0.4438 0.6466
Confidence Interval
[-0.0411, Inf] [-0.0465, 0.0203] [-0.0415, 0.0265] [-0.2057, Inf] [-0.2324, 0.1017 [-0.2073, 0.1323] [-0.4660, 0.2118] [-0.4649, 0.2034] [-0.4146, 0.2645] [-0.6171, Inf] [-0.6973, 0.3051] [-0.6219, 0.3968]
Complete data Class 2
Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 2
σ=10 σ=15
Complete data Class 1 Complete data Class 1 Complete data Class 1
METHOD : OVERALL MEAN IMPUTATION
Complete data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
σ=1 σ=5
Mean Imputed data Class 1
Complete data Class 1
Complete data Class 2
Mean Imputed data Class 1
Mean Imputed data Class 2
C2
C2
C2
C2
C2
C2
C2
C2
Complete data Class 2
Table 6.1 Missing data rate 20% with Overall Imputation Method
33
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.26143 0.2527 0.26845 1.3071 1.2635 1.3423 2.6405 2.345 2.8482 3.9214 3.7905 4.0268Standard
Deviation0.22707 0.23694 0.22343 1.1353 1.1847 1.1171 2.2692 2.2563 2.3075 3.406 3.5541 3.3514
Percentage of Mean Error
16.88% 17.35% 17.65% 16.88% 17.35% 17.65% 16.05% 23.30% 12.63% 16.88% 17.35% 17.65%
Mean 0.28286 0.25514 0.26502 1.4143 1.2757 1.3251 2.79 2.4657 2.7664 4.2429 3.8271 3.9752Standard
Deviation0.23593 0.22891 0.23154 1.1797 1.1445 1.1577 2.3168 2.2204 2.3343 3.539 3.4336 3.4732
Percentage of Mean Error
12.89% 15.75% 20.74% 12.89% 15.75% 20.73% 14.07% 18.58% 17.26% 12.89% 15.75% 20.74%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 1 1 1 1 1 1 1 1 1 1 1 1Significance
Level of confidence
0.0046 0.0021 1.53E-04 0.0046 0.0021 1.53E-04 0.0046 0.0021 1.53E-04 0.0046 0.0021 1.53E-04
Confidence Interval
[-0.4472, 04641] [-0.4082, 04652] [-0.3011, 0.6170] [-0.3996, -0.0733] [-0.4110, -0.0912] [-0.4775, -0.1524] [-0.7992, -0.1466] [-0.8220, 0.1824] [-0.9551, -0.3048] [-1.1989, -0.2198] [-1.2330, -0.2376] [-1.4326, -0.4571]
METHOD : OVERALL MEAN IMPUTATION
Complete data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
σ=1 σ=5
Mean Imputed data Class 1
Complete data Class 1
Complete data Class 2
Mean Imputed data Class 1
Mean Imputed data Class 2
Complete data Class 2
σ=10 σ=15
Complete data Class 1 Complete data Class 1 Complete data Class 1
Mean Imputed data Class 1
Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Complete data Class 2
Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 1
Mean Imputed data Class 2
C2
C1
C2
C1
C2
C1
C2
C1
Table 6.2 Missing data rate at 40% with Overall Imputation Method
34
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.2342 0.21332 0.23577 1.171 1.0666 1.1788 2.342 2.1332 2.3577 3.5131 3.1998 3.5365Standard
Deviation0.21487 0.21467 0.20597 1.0743 1.0733 1.0298 2.1487 2.1467 2.0597 3.223 3.22 3.0895
Percentage of Mean Error
25.54% 30.23% 27.68% 25.54% 30.23% 27.68% 25.54% 30.23% 27.68% 25.54% 30.23% 27.68%
Mean 0.2545 0.23198 0.2408 1.2725 1.1599 1.204 2.545 2.3198 2.408 3.8174 3.4797 3.612Standard
Deviation0.22456 0.21503 0.21737 1.1228 1.0752 1.0868 2.2456 2.1503 2.1737 3.3684 3.2255 3.2605
Percentage of Mean Error
21.62% 23.39% 27.98% 21.62% 23.39% 27.98% 21.62% 23.39% 27.98% 21.62% 23.39% 27.98%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 0 0 0 0 0 0 1 1 1 1 1 1Significance
Level of confidence
4.47E-06 3.48E-07 1.78E-08 4.47E-06 3.48E-07 1.78E-08 4.47E-06 3.48E-07 1.78E-08 4.47E-06 3.48E-07 1.78E-08
Confidence Interval
[-0.1073, -0.0432] [-0.1129, -0.0504] [-0.1236, -0.0601] [-0.5364, -0.2162] [-0.5643, -0.2520] [-0.6182, -0.3007] [-1.0727, -0.4325] [-1.1286, -0.5041] [-1.2364, -0.6014] [-1.6091, -0.6487] [-1.6930, -0.7561] [-1.8546, -0.9021]
Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Complete data Class 2
Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 1
Mean Imputed data Class 2
C2
C1
Complete data Class 1 Complete data Class 1 Complete data Class 1
Mean Imputed data Class 1
Mean Imputed data Class 2
Complete data Class 2
σ=10 σ=15METHOD : OVERALL MEAN IMPUTATION
Complete data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
σ=1 σ=5
Mean Imputed data Class 1
Complete data Class 1
Complete data Class 2
Mean Imputed data Class 1
C1
C2C2
C1 C1
C2
Table 6.3 Missing data rate at 50% with Overall Mean Imputation
35
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.3014 0.29114 0.32762 1.507 1.4557 1.6381 3.014 2.9114 3.2762 3.7382 4.0809 3.6809Standard
Deviation0.26053 0.26533 0.25373 1.3026 1.3267 1.2686 2.6053 2.6533 2.5373 3.8788 3.9744 3.62
Percentage of Mean Error
3.24% 0.96% 2.56% 3.24% 0.96% 2.56% 3.24% 0.96% 2.56% 3.24% 0.96% 2.56%
Mean 0.3124 0.29128 0.31772 1.562 1.4564 1.5886 3.124 2.9128 3.1772 4.6628 4.2808 4.7409Standard
Deviation0.2579 0.25414 0.25699 1.2895 1.2707 1.2849 2.579 2.5414 2.5699 4.1036 3.7482 3.7074
Percentage of Mean Error
4.17% 3.81% 4.97% 3.79% 3.81% 4.97% 3.79% 3.81% 4.97% 4.26% 5.76% 5.47%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 0 0 0 0 0 0 0 0 0 1 1 1Significance
Level of confidence
0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.0296 0.1201 0.09
Confidence Interval
[-0.0411, Inf] [-0.0465, 0.0203] [-0.0415, 0.0265] [-0.2057, Inf] [-0.2324, 0.1017 [-0.2073, 0.1323] [-0.4660, 0.2118] [-0.4649, 0.2034] [-0.4146, 0.2645] [-1.1698, -0.0611] [-1.004, 0.1161] [-0.9964, 0.0723]
METHOD : CLASS MEAN IMPUTATIONσ=1 σ=5 σ=10 σ=15
Complete data Class 1 Complete data Class 1 Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
C1
C2
C2
C2
C2
C2
C1
C1
Table 6.4 Missing data rate at 20% with Class Mean Imputation
36
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.30926 0.29102 0.31783 1.5463 1.4551 1.5892 3.0926 2.9102 3.1783 4.6389 4.3654 4.7675Standard
Deviation0.21706 0.21749 0.21384 1.0853 1.0875 1.0692 2.1706 2.1749 2.1384 3.2558 3.2624 3.2076
Percentage of Mean Error
1.67% 4.82% 2.51% 1.67% 4.82% 2.50% 1.67% 4.82% 2.51% 1.67% 4.82% 2.51%
Mean 0.33823 0.30274 0.3267 1.6911 1.5137 1.6335 3.3823 3.0274 3.267 5.0734 4.5411 4.9004Standard
Deviation0.24127 0.21162 0.22037 1.2063 1.0581 1.1019 2.4127 2.1162 2.2037 3.619 3.1743 3.3056
Percentage of Mean Error
4.17% 0.03% 2.29% 4.16% 0.03% 2.29% 4.17% 0.03% 2.29% 4.17% 0.03% 2.29%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 0 0 0 0 0 0 0 0 0 0 0 0Significance
Level of confidence
0.8017 0.6416 0.6248 0.8017 0.6416 0.6248 0.8017 0.6416 0.6248 0.8017 0.6416 0.6248
Confidence Interval
[-0.0282, 0.0364] [-0.0386, 0.0238] [-0.0396, 0.0238][-0.1408, 0.1821] [-0.1930, 0.1190] [-0.1982, 01191] [-0.2816, 0.3642] [-0.3860, 0.2379] [-0.3964, 0.2382] [-0.4224, 0.5464] [-0.5790, 0.3569] [-0.5946, 0.3573]
C2
C1
METHOD : CLASS MEAN IMPUTATIONσ=1 σ=5 σ=10 σ=15
Complete data Class 1 Complete data Class 1 Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
C2
C1
C2
C1
C2
C1
Table 6.5 Missing data rate at 40% with Class Mean Imputation
37
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.31305 0.29513 0.31466 1.5652 1.4756 1.5733 3.1305 2.9513 3.1466 4.6957 4.4269 4.7199Standard
Deviation0.20159 0.20402 0.19295 1.008 1.0201 0.96475 2.0159 2.0402 1.9295 3.0239 3.0603 2.8942
Percentage of Mean Error
0.47% 3.47% 3.48% 3.24% 0.96% 2.56% 0.47% 3.47% 3.48% 0.47% 3.48% 3.48%
Mean 0.33216 0.30608 0.33346 1.6608 1.5304 1.6673 3.3216 3.0608 3.3346 4.9824 4.5912 5.0019Standard
Deviation0.2167 0.19987 0.21145 1.0835 0.99934 1.0572 2.167 1.9987 2.1145 3.2505 2.998 3.1717
Percentage of Mean Error
2.30% 1.08% 0.27% 2.30% 1.08% 0.26% 2.30% 1.08% 0.27% 2.30% 1.08% 0.27%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 0 0 0 0 0 0 0 0 0 0 0 0Significance
Level of confidence
0.8507 0.8131 0.6986 0.8507 0.8131 0.6986 0.8507 0.8131 0.6986 0.8507 0.8131 0.6986
Confidence Interval
[-0.0282, 0.0342] [-0.0342, 0.0269] [-0.0371, 0.0249][-0.1410, 0.1710] [-0.1711, 0.1343] [-0.1855, 0.1243][-0.2821, 0.3420] [-0.3422, 0.2686] [-0.3710, 0.2487][-0.4231, 0.5129] [-0.5133, 0.4029] [-0.5565, 0.3730]
C1
C2 C2
C1
C2
C1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
C2
C1
METHOD : CLASS MEAN IMPUTATIONσ=1 σ=5 σ=10 σ=15
Complete data Class 1 Complete data Class 1 Complete data Class 1
Table 6.6 Missing data rate at 50% with Class Mean Imputation
38
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.31791 0.30411 0.31442 1.5895 1.5204 1.5721 3.1791 3.0408 3.1443 4.7686 4.561 4.7164Standard
Deviation0.26725 0.26865 0.25623 1.3362 1.3433 1.2811 2.6724 2.6865 2.5622 4.0086 4.0298 3.8433
Percentage of Mean Error
1.08% 0.54% 3.55% 1.07% 0.55% 3.55% 1.08% 0.55% 3.55% 1.08% 0.55% 3.55%
Mean 0.33496 0.27327 0.31317 1.675 1.3665 1.566 3.3503 2.7332 3.1321 5.0256 4.1 4.6983Standard
Deviation0.26557 0.23173 0.25669 1.3278 1.1586 1.2834 2.6554 2.3172 2.5667 3.9831 3.4757 3.85
Percentage of Mean Error
3.16% 9.76% 6.33% 3.17% 9.75% 6.32% 3.18% 9.74% 6.32% 3.18% 9.74% 6.32%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 0 0 0 0 0 0 0 0 0 0 0 0Significance
Level of confidence
0.2197 0.1336 0.1381 0.3265 0.2207 0.2117 0.4262 0.3092 0.2836 0.5139 0.3914 0.3493
Confidence Interval
[-0.0553, 0.0127] [-0.0589, 0.0078] [-0.0606, 0.0084][-0.2547, 0.0848] [-0.2707, 0.0626] [-0.2815, 0.0625][-0.4766, 0.2015] [-0.5055, 0.1603] [-0.5311, 0.1557][-0.6772, 0.3390] [-0.7172, 0.2811] [-0.7600, 0.2690]
C1
C1
C1
C1
C1
C1
C1
Complete data Class 1 Complete data Class 1
METHOD : LINEAR REGRESSION IMPUTATIONσ=1 σ=5 σ=10 σ=15
C1
Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Table 6.7 Missing data rate at 20% with linear regression
39
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.28137 0.27279 0.28492 1.4069 1.3641 1.4249 2.8139 2.7283 2.85 4.221 4.0926 4.2753Standard
Deviation0.23815 0.24016 0.2327 1.1907 1.2007 1.1634 2.3814 2.4014 2.3266 3.5721 3.602 3.4898
Percentage of Mean Error
10.54% 10.78% 12.60% 10.54% 10.77% 12.58% 10.53% 10.77% 12.58% 10.53% 10.76% 12.57%
Mean 0.26788 0.25818 0.27402 1.3395 1.2912 1.3701 2.6792 2.5827 2.7401 4.019 3.8744 4.11Standard
Deviation0.23355 0.22214 0.24164 1.1677 1.1105 1.2082 2.3353 2.221 2.4164 3.5028 3.3313 3.6246
Percentage of Mean Error
17.50% 14.74% 18.04% 17.49% 14.72% 18.04% 17.49% 14.71% 18.05% 17.48% 14.70% 18.05%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 1 1 1 1 1 1 1 1 1 1 1 1Significance
Level of confidence
5.37E-04 0.0049 1.96E-04 4.84E-04 0.0042 1.84E-04 4.47E-04 0.038 1.73E-04 4.21E-04 0.0035 1.66E-04
Confidence Interval
[-0.0902, -0.0251] [-0.0784, -0.0140] [-0.0947, -0.0295] [-0.4537, -0.1278] [-0.3964, -0.0743] [-0.4749, -0.1489] [-0.9113, -0.2592] [-0.7988, -0.1542] [-0.9525, -0.3002] [-1.3713, -0.3930] [-1.2048, -0.2376] [-1.4318, -0.4532]
C2
C2
C2
C2
C2
C2
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
C2
C2
METHOD : LINEAR REGRESSION IMPUTATIONσ=1 σ=5 σ=10 σ=15
Complete data Class 1 Complete data Class 1 Complete data Class 1
Table 6.8 Missing data rate at 40% with linear regression
40
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.25285 0.22079 0.25613 1.2645 1.1042 1.2808 2.5293 2.2087 2.5616 3.7941 3.3133 3.8425Standard
Deviation0.22358 0.22552 0.2115 1.1178 1.1275 1.0574 2.2354 2.2548 2.1147 3.353 3.3822 3.172
Percentage of Mean Error
19.61% 27.79% 21.43% 19.59% 27.77% 21.42% 19.58% 27.76% 21.42% 19.58% 27.76% 21.42%
Mean 0.26167 0.20792 0.26347 1.3083 1.0396 1.3174 2.6167 2.0792 2.6348 3.925 3.1188 3.9523Standard
Deviation0.22997 0.19598 0.22639 1.1498 0.9799 1.1319 2.2996 1.9598 2.2638 3.4494 2.9397 3.3956
Percentage of Mean Error
19.41% 31.34% 21.20% 19.41% 31.34% 21.19% 19.41% 31.34% 21.20% 19.41% 31.34% 21.19%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 1 1 1 1 1 1 1 1 1 1 1 1Significance
Level of confidence
1.48E-04 1.69E-08 1.58E-05 1.49E-04 1.70E-08 1.59E-05 1.49E-04 1.71E-08 1.59E-05 1.50E-04 1.71E-08 1.59E-05
Confidence Interval
[-0.0945, -0.0302] [-0.1210, -0.0589] [-0.1022, -0.0385] [-0.4722, -0.1510] [-0.6046, -0.2944] [-0.5110, -0.1926] [-0.9443, -0.3019] [-1.2091, -0.5887] [-1.0218, -0.3852] [-1.4164, -0.4528] [-1.8135, -0.8829] [-1.5327, -0.5778]
C2
C2
C2
C2
C2
C2
C2
Complete data Class 1 Complete data Class 1
METHOD : LINEAR REGRESSION IMPUTATIONσ=1 σ=5 σ=10 σ=15
C2
Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Table 6.9 Missing data rate at 50% with linear regression
41
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.42829 0.44765 0.41436 1.6172 1.6647 1.6446 3.2396 3.2633 3.2563 4.8345 4.8308 4.901Standard
Deviation0.32514 0.33841 0.31359 1.4334 1.4346 1.3809 2.8573 2.9101 2.7906 4.3091 4.4244 4.1683
Percentage of Mean Error
36.17% 46.41% 27.10% 2.84% 8.89% 0.90% 3.00% 6.73% 0.11% 2.47% 5.33% 0.22%
Mean 0.43954 0.47639 0.43657 1.6747 1.4356 1.7143 3.3064 2.7783 3.4028 4.9118 4.1596 5.0878Standard
Deviation0.34593 0.32609 0.34044 1.3791 1.2722 1.4965 2.7812 2.6048 3.0112 4.2176 3.9131 4.5317
Percentage of Mean Error
35.37% 57.32% 30.57% 3.15% 5.18% 2.55% 1.83% 8.25% 1.77% 0.85% 8.43% 1.45%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 1 1 1 0 0 0 0 0 0 0 0 0Significance
Level of confidence
4.87E-08 4.11E-04 4.14E-06 0.6158 0.7853 0.7679 0.6875 0.9068 0.8861 0.7837 0.8073 0.886
Confidence Interval
[0.0735, 0.1551] [0.1174, 0.1980] [0.00549, 0.1357] [-0.1394, 0.2353] [-0.1545, 0.2120] [-0.1614, 0.2185] [-0.2981, 0.4519] [-0.3915, 0.3474] [-0.3534, 0.4091] [-0.4855, 0.6435] [-0.6251, 0.4868] [-0.5301, 0.6136]
C2
C2
C2
C2
C2
C1
C1
Complete data Class 1 Complete data Class 1
METHOD : REWEIGHTINGσ=1 σ=5 σ=10 σ=15
C2
Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Table 6.10 Missing data rate at 20% with Reweighting
42
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.4236 0.4594 0.41767 1.686 1.5455 1.5342 3.3768 2.9717 3.0338 5.0465 4.3671 4.6049Standard
Deviation0.32952 0.33992 0.33667 1.444 1.373 1.3551 2.8805 2.8121 2.7438 5.0465 4.3671 4.6049
Percentage of Mean Error
34.68% 50.25% 28.12% 7.21% 1.09% 5.88% 7.36% 2.81% 6.94% 6.97% 4.78% 5.83%
Mean 0.42237 0.45395 0.4579 1.8174 1.5074 1.6398 3.5906 2.927 3.2783 5.3849 4.4126 4.8523Standard
Deviation0.34326 0.3234 0.33321 1.4243 1.3473 1.4213 2.8745 2.7559 2.8359 4.309 4.113 4.3168
Percentage of Mean Error
30.08% 49.91% 36.95% 11.94% 0.44% 1.91% 10.58% 3.34% 1.95% 10.56% 2.86% 3.25%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 1 1 1 0 0 0 0 0 0 0 0 0Significance
Level of confidence
1.63E-04 3.80E-14 5.56E-07 0.4222 0.7521 0.985 0.1711 0.6494 0.4847 0.1813 0.5727 0.4379
Confidence Interval
[0.0403, 0.1269] [0.1275, 0.2147] [0.0688, 0.1563] [-0.1223, 0.2918] [-0.1700, 0.2352] [-0.2042, 0.2082] [-0.1245, 0.6996] [-0.4970, 0.3100] [-0.5548, 0.2634] [-0.1970, 1.0401] [-0.7811, 0.4323] [-0.8379, 0.3898]
C2
C2
C2
C2
C2
C1
C2
Complete data Class 1 Complete data Class 1
METHOD : REWEIGHTINGσ=1 σ=5 σ=10 σ=15
C2
Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Table 6.11 Missing data rate at 40% with Reweighting
43
Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3
Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard
Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872
Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard
Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926
Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5
Mean 0.42829 0.44765 0.41436 1.7654 1.6111 1.6628 3.556 3.0867 3.2416 5.3109 4.5696 4.8451Standard
Deviation0.32514 0.33841 0.31359 1.4388 1.4413 1.3989 2.8492 2.9815 2.8582 4.2982 4.5254 4.3025
Percentage of Mean Error
36.17% 46.41% 27.10% 12.26% 5.38% 2.01% 13.06% 0.96% 0.56% 12.57% 0.36% 0.92%
Mean 0.43954 0.47639 0.43657 1.6502 1.4824 1.4776 3.2698 2.9157 2.9521 4.8961 4.3977 4.4346Standard
Deviation0.34593 0.32609 0.34044 1.3909 1.3541 1.3422 2.8057 2.7458 2.6852 4.2155 4.0979 4.0205
Percentage of Mean Error
35.37% 57.32% 30.57% 1.64% 2.09% 11.61% 0.70% 3.72% 11.71% 0.53% 3.18% 11.58%
Classification with complete
data
Classification with Imputed data
Null Hypothesis
h 1 1 1 0 0 0 0 0 0 0 0 0Significance
Level of confidence
1.14E-04 7.57E-11 4.06E-04 0.3218 0.8168 0.464 0.3278 0.85 0.3541 0.3525 0.8076 0.3458
Confidence Interval
[0.0447, 0.1362] [0.1076, 0.1986] [0.0741, 0.1655] [-0.1076, 0.3271] [-0.1892, 0.2399] [-0.2969, 0.1355] [-0.2179, 0.6514] [-0.4741, 0.3908] [-0.6387, 0.2289] [-0.3434, 0.9620] [-0.7302, 0.5689] [-0.9637, 0.3381]
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed distribution is equal to Complete Case Distribution
Mean Imputed data Class 1
Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2
C2
Complete data Class 1 Complete data Class 1
Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2
Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1
METHOD : REWEIGHTINGσ=1 σ=5 σ=10 σ=15
C2
C2
C2
C2
C2
C1
C2
Complete data Class 1 Complete data Class 1
Table 6.12 Missing data rate at 50% with Reweighting
44
Method
Sigma 1 5 10 15 1 5 10 15 1 5 10 15 1 5 10 15Percentage of
Missing20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20%
Misclassification N N N N N N N N N N N N N N N NSigma 1 5 10 15 1 5 10 15 1 5 10 15 1 5 10 15
Percentage of Missing
40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40%
Misclassification Y Y Y Y Y Y Y Y Y N N N N N N NSigma 1 5 10 15 1 5 10 15 1 5 10 15 1 5 10 15
Percentage of Missing
50% 50% 40% 50% 50% 50% 40% 50% 50% 50% 40% 50% 50% 50% 40% 50%
Misclassification Y Y Y Y Y Y Y Y Y N N N N N N N
Linear RegressionOverall Mean Imputation Class Mean Imputation Reweighting
Table 6.13 Methods Comparison
45
7 CONCLUSIONS
In this project, various techniques of missing data treatment have been studied in
detail and some have been used in the evaluation. The goal is to find the best and highly
reliable method to be used for a given set of cases. Beforehand, the mechanism of
missing data plays an important role in the selection of methods to be used. Three known
mechanisms have been discussed which are Missing Completely At Random (MCAR),
Missing At Random (MAR) and Missing Not At Random (MNAR). Knowing the
properties of those will give ideas on which method is suitable to be used.
Obviously, deletion is the less favorable method as it leads to bias solution.
However, we might need to consider this method if we have sufficient number of data to
represent the samples population. Deletion is widely used as default for many software
packages due to its simplicity and faster computation.
Imputation methods should be opted if the missing data in the samples chosen is
not sufficient to support our decisions. Imputation in contrast to deletion predicts values
to be imputed in the missing cells. By doing so, there will be sufficient data to support
our analysis. With various imputation methods available, ones could select the most
appropriate to be used depending on the mechanisms of missing. Some imputation
methods appear to be reliable for example hot deck imputation but again it depends on
missing mechanism.
With multiple imputations, confidence on predicted values is higher and hence
misleading is very unlikely to happen. However, multiple imputations add complexity to
analysis. If time is a constraint, multiple imputations should not be of choice.
Reweighting utilizes the observed data for making inferences. Those missing
values will be taken into account by assigning weights to the data. The missing values are
not being ignored at all. They are just being represented by the observed data.
Reweighting combined with regression would not be of much difficulty since there is a
lot of prediction software available.
46
REFERENCES [1] Myrtveit I, Stensrud E and Olsson Ulf H., 2001, “Analyzing Data Sets with
Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods”, IEEE Transactions On Software Engineering”, vol 27, pp 999-1010.
[2] Beaumont J.F, Haziza. D, Mitchell C. and Rancourt E., “New Tools at Statistics Canada to Measure and Evaluate the Impact of Nonresponse and Imputation”, pp 76-77.
[3] Honaker J, Joseph A, King G, Scheve K and Singh N., “A program for missing data”, http://gking.harvard.edu/amelia/.
[4] SAS Institute Inc., SAS OnlineDoc®, Version 8, Cary, NC: SAS Institute Inc., 1999, http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap55/sect39.htm
[5] “Generalized Linear Model”, StatSci.org http://www.statsci.org/glm/intro.htm [6] Duda R.O, Hart P.E and Stork D.G., 2001,“Pattern Classification”, Second
Edition, (John Wiley & Sons, Inc)
47
APPENDIX A – OVERALL MEAN IMPUTATION
rand('seed',1); x=rand(500,3)*sqrt(10)+5; % complete random data [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % standard deviation for the set of d ata % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 Muc1=(muc1'*ones(1,m/2))'; varc1=std(c1); prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 Muc2=(muc2'*ones(1,m/2))'; varc2=std(c2); prc2=1-prc1; % prior probability of class 2 %================================================== ======================== % CREATING MISSING DATA mis=250; idx1=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx1=idx1(1:100); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx2=idx2(1:100); idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=idx3(1:100); dm=d; dm(idx1,1)=0; dm(idx2,2)=0; dm(idx3,3)=0; dmiss=dm;
48
%================================================== ==================== % OVERALL MEAN IMPUTATION mudmiss=mean(dmiss); dmiss(idx1,1)=mudmiss(1); dmiss(idx2,2)=mudmiss(2); dmiss(idx3,3)=mudmiss(3); doverall_mean=dmiss; mu_doverall_mean=mean(doverall_mean); stdv_doverall_mean=std(doverall_mean); sigma_doverall_mean=cov(doverall_mean); %%% CLASS 1 c1_doverall_mean=doverall_mean(1:(m/2),:); muc1_doverall_mean=mean(c1_doverall_mean); var_c1_doverall_mean=std(c1_doverall_mean); sigmac1_doverall_mean=cov(doverall_mean); %%% CLASS 2 c2_doverall_mean=doverall_mean((m/2)+1:m,:); muc2_doverall_mean=mean(c2_doverall_mean); var_c2_doverall_mean=std(doverall_mean); sigmac2_doverall_mean=cov(c2_doverall_mean); % T TEST % Hypothesis testing % Null Hypothesis : There is no difference with com plete case result % h=0; reject the null hypothesis at alpha level of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(doverall_mean(:,1),d(:,1),0.05 ) [h2,sig2,ci2]=ttest2(doverall_mean(:,2),d(:,2),0.05 ) [h3,sig3,ci3]=ttest2(doverall_mean(:,3),d(:,3),0.05 ) % TEST test=input(' enter test values '); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end
49
Ac1=(-0.5.*(test)*inv(sigmac1_doverall_mean)*(test) '); Bc1=test*inv(sigmac1_doverall_mean)*muc1_doverall_m ean'; Cc1=-0.5*muc1_doverall_mean*inv(sigmac1_doverall_mean)*m uc1_doverall_mean'; Dc1=log(det(sigmac1_doverall_mean)); g1mdt=Ac1+Bc1+Cc1-Dc1+log(prc1); Ac2=(-0.5.*(test)*inv(sigmac2_doverall_mean)*(test) '); Bc2=test*inv(sigmac2_doverall_mean)*muc2_doverall_m ean'; Cc2=-0.5*muc2_doverall_mean*inv(sigmac2_doverall_mean)*m uc2_doverall_mean'; Dc2=log(det(sigmac2_doverall_mean)); g2mdt=Ac2+Bc2+Cc2-Dc2+log(prc2); resmdt=g1mdt-g2mdt; if resmdt>0 disp('regression = c1') else disp('regression = c2') end
50
APPENDIX B – CLASS MEAN IMPUTATION
rand('seed',1); x=rand(400,3)*sqrt(15)+5; % complete random data xreg=rand(500,3)+sqrt(20)+5; [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % standard deviation for the set of d ata % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 Muc1=(muc1'*ones(1,m/2))'; varc1=std(c1); pc1=exp(-0.5.*(c1-Muc1)*inv(sigmac1).*(c1-Muc1))/(2*pi)^(0.5*m/2)*(det(sigmac1)^0.5);% p(x|c1 ) prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 Muc2=(muc2'*ones(1,m/2))'; varc2=std(c2); pc2=exp(-0.5.*(c2-Muc2)*inv(sigmac2).*(c2-Muc2))/(2*pi)^(0.5*m/2)*(det(sigmac2)^0.5);% p(x|c2 ) prc2=1-prc1; % prior probability of class 2 %================================================== ======================== % CREATING MISSING DATA mis=200; idx1=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx1=sort(idx1(1:100)); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx2=sort(idx2(1:100)); idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=sort(idx3(1:100)); dm=d; dm(idx1,1)=0; dm(idx2,2)=0;
51
dm(idx3,3)=0; dmiss=dm; %================================================== ==================== % CLASS MEAN IMPUTATION %%% CLASS 1 c1_dmiss_class_imp=dmiss(1:m/2,:); c1idx1=max(find(idx1<=250)); % find number of m issing data in class 1, var 1 c1idx2=max(find(idx2<=250)); % find number of m issing data in class 1, var 2 c1idx3=max(find(idx3<=250)); % find number of m issing data in class 1, var 3 idx11=idx1(1:c1idx1); idx12=idx2(1:c1idx2); idx13=idx3(1:c1idx3); c1_dmiss_class_imp(idx11,1)=muc1(1); % class1 m ean imputation, var 1 c1_dmiss_class_imp(idx12,2)=muc1(2); % class1 m ean imputation, var 2 c1_dmiss_class_imp(idx13,3)=muc1(3); % class1 m ean imputation, var 3 c1_dmiss_class_imp=c1_dmiss_class_imp; muc1_dmiss_class_imp=mean(c1_dmiss_class_imp); varc1_dmiss_class_imp=std(c1_dmiss_class_imp); sigmac1_dmiss_class_imp=cov(c1_dmiss_class_imp) ; %%% CLASS 2 c2_dmiss_class_imp=dmiss((m/2)+1:m,1:n); idx21=idx1(c1idx1+1:100); idx22=idx2(c1idx2+1:100); idx23=idx3(c1idx3+1:100); dmiss(idx21,1)=muc2(1); % class2 mean imputatio n, var 1 dmiss(idx22,2)=muc2(2); % class2 mean imputatio n, var 2 dmiss(idx23,3)=muc2(3); % class2 mean imputatio n, var 3 c2_dmiss_class_imp=dmiss((m/2)+1:m,1:n); muc2_dmiss_class_imp=mean(c2_dmiss_class_imp); var_c2_dmiss_class_imp=std(c2_dmiss_class_imp); sigmac2_dmiss_class_imp=cov(c2_dmiss_class_imp) ; % WHOLE DATA AFTER CLASS MEAN IMPUTATION dmiss_class_imp=[c1_dmiss_class_imp; c2_dmiss_class _imp]; % T TEST % Hypothesis testing % Null Hypothesis : There is no difference with com plete case result % h=0; reject the null hypothesis at alpha level of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(dmiss_class_imp(:,1),d(1),0.01 )
52
[h2,sig2,ci2]=ttest2(dmiss_class_imp(:,2),d(2),0.01 ) [h3,sig3,ci3]=ttest2(dmiss_class_imp(:,3),d(3),0.01 ) % TEST %===================================== test=input(' enter test values '); %g1=ln(p(x|c1))+ln(P(c1))=ln(ptc1)+ln(prc1) %p(x|c1)=exp(-0.5(test-Muc1)*inv(Muc1)*(test-Muc1))/(2*pi^(0.5*m/2)*(det(sigmac1^0.5)))=exp(-0.5*A*inv(Muc1)*A'/(2*pi^(0.5*m/2)*(det(sigmac1^0.5 )))); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end %===================================== Ac1=(-0.5.*(test)*inv(sigmac1_dmiss_class_imp)*(tes t')); Bc1=test*inv(sigmac1_dmiss_class_imp)*muc1_dmiss_cl ass_imp'; Cc1=-0.5*muc1_dmiss_class_imp*inv(sigmac1_dmiss_class_im p)*muc1_dmiss_class_imp'; Dc1=log(det(sigmac1_dmiss_class_imp)); g1reg=Ac1+Bc1+Cc1-Dc1+log(prc1); Ac2=(-0.5.*(test)*inv(sigmac2_dmiss_class_imp)*(tes t')); Bc2=test*inv(sigmac2_dmiss_class_imp)*muc2_dmiss_cl ass_imp'; Cc2=-0.5*muc2_dmiss_class_imp*inv(sigmac2_dmiss_class_im p)*muc2_dmiss_class_imp'; Dc2=log(det(sigmac2_dmiss_class_imp)); g2reg=Ac2+Bc2+Cc2-Dc2+log(prc2); resreg=g1reg-g2reg; if resreg>0 disp('class mean imputation = c1') else disp('class mean imputation = c2') end %=============================================
53
APPENDIX C – REGRESSION IMPUTATION rand('seed',1); x=rand(500,3)*sqrt(1)+5; % complete random data xaux=rand(400,3)*sqrt(15)+5; [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % covariance for the set of data var=std(d); % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 varc1=std(c1); Muc1=(muc1'*ones(1,m/2))'; pc1=exp(-0.5.*(c1-Muc1)*inv(sigmac1).*(c1-Muc1))/(2*pi)^(0.5*m/2)*(det(sigmac1)^0.5);% p(x|c1 ) prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 varc2=std(c2); Muc2=(muc2'*ones(1,m/2))'; pc2=exp(-0.5.*(c2-Muc2)*inv(sigmac2).*(c2-Muc2))/(2*pi)^(0.5*m/2)*(det(sigmac2)^0.5);% p(x|c2 ) prc2=1-prc1; % prior probability of class 2 %================================================== ======================== %INCOMPLETE DATA mis=200; %idx=floor(rand(50,3)*499+1.5); % case missing idx1=unique(floor(rand(mis,5)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx1=idx1(1:100); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx2=idx2(1:100); idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=idx3(1:100);
54
dm=d; dm(idx1,1)=0; % inserting missing data into variabl e 1 dm(idx2,2)=0; % inserting missing data into variabl e 2 dm(idx3,3)=0; % inserting missing data into variabl e 3 %================================================== ============== % PREPROCESS DATA dmm=(dm(dm>0)); p=length(dmm); dmm1=dm(1:p/3); dmm2=dm((p/3)+1:2*p/3); dmm3=dm((2*p/3)+1:p); dmm=[dmm1' dmm2' dmm3']; sigmam=cov(dmm); % new covariance with missing data varm=std(dmm); % covariance for missing data mum=mean(dmm); % new mean with missing data % CLASS 1 WITH MISSING DATA c1m=dm(1:m/2,:); % data for class 1 muc1m=mean(c1m) % new mean for class 1 sigmac1m=cov(c1m); % sigma for c1 varc1=std(c1m); %standard deviation for c1m Muc1m=(muc1m'*ones(1,m/2))'; prc1m=prc1; % prior probability of class 1 % CLASS 2 WITH MISSING DATA c2m=dm((m/2)+1:m,:); % data for class 2 muc2m=mean(c2m) % new mean for class 2 sigmac2m=cov(c2m); % sigma for c2 varc2m=std(c2m); Muc2m=(muc2m'*ones(1,m/2))'; prc2m=1-prc1m; % prior probability of class 2 %REGRESSION % Y=bX bd1=regress(dmm1',xaux); bd2=regress(dmm2',xaux); bd3=regress(dmm3',xaux); dreg1=x*bd1; dreg2=x*bd2; dreg3=x*bd3; dreg=[dreg1 dreg2 dreg3]; dm(idx1,1)=dreg1(idx1);dm(idx2,2)=dreg2(idx2);dm(id x3,3)=dreg3(idx3); dnew=dm; mudnew=mean(dnew); sigmadnew=cov(dnew); c1reg=dnew(1:m/2,:); muc1reg=mean(c1reg); sigmac1reg=cov(c1reg); varc1reg=std(c1reg); c2reg=dnew((m/2)+1:m,:); muc2reg=mean(c2reg);
55
sigmac2reg=cov(c2reg); varc2reg=std(c2reg); % T TEST % Hypothesis testing % h=0; cannot reject the null hypothesis at alpha l evel of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(dnew(:,1),d(:,1),0.05) [h2,sig2,ci2]=ttest2(dnew(:,2),d(:,2),0.05) [h3,sig3,ci3]=ttest2(dnew(:,3),d(:,3),0.05) % TEST test=input(' enter test values '); %g1=ln(p(x|c1))+ln(P(c1))=ln(ptc1)+ln(prc1) %p(x|c1)=exp(-0.5(test-Muc1)*inv(Muc1)*(test-Muc1))/(2*pi^(0.5*m/2)*(det(sigmac1^0.5)))=exp(-0.5*A*inv(Muc1)*A'/(2*pi^(0.5*m/2)*(det(sigmac1^0.5 )))); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2reg*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end %===================================== Ac1=(-0.5.*(test)*inv(sigmac1reg)*(test)'); Bc1=test*inv(sigmac1reg)*muc1reg'; Cc1=-0.5*muc1reg*inv(sigmac1reg)*muc1reg'; Dc1=log(det(sigmac1reg)); g1reg=Ac1+Bc1+Cc1-Dc1+log(prc1m); Ac2=(-0.5.*(test)*inv(sigmac2reg)*(test)'); Bc2=test*inv(sigmac2reg)*muc2reg'; Cc2=-0.5*muc2reg*inv(sigmac2reg)*muc2reg'; Dc2=log(det(sigmac2reg)); g2reg=Ac2+Bc2+Cc2-Dc2+log(prc2m); resreg=g1reg-g2reg; if resreg>0 disp('regression = c1') else disp('regression = c2') end
56
APPENDIX D – REWEIGHTING rand('seed',1); x=rand(500,3)*sqrt(1)+5; % complete random data xreg=rand(400,3)*sqrt(1)+5; [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; rd=x-5*ones(m,n); mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % covariance for the set of data var=std(d); % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 varc1=std(c1); Muc1=(muc1'*ones(1,m/2))'; pc1=exp(-0.5.*(c1-Muc1)*inv(sigmac1).*(c1-Muc1))/(2*pi)^(0.5*m/2)*(det(sigmac1)^0.5);% p(x|c1 ) prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 varc2=std(c2); Muc2=(muc2'*ones(1,m/2))'; pc2=exp(-0.5.*(c2-Muc2)*inv(sigmac2).*(c2-Muc2))/(2*pi)^(0.5*m/2)*(det(sigmac2)^0.5);% p(x|c2 ) prc2=1-prc1; % prior probability of class 2 %================================================== ======================== %INCOMPLETE DATA mis=200; %idx=floor(rand(50,3)*499+1.5); % case missing idx1=unique(floor(rand(mis,5)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx11=idx1(1:50); idx12=idx1(51:100); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx21=idx2(1:50); idx22=idx2(51:100);
57
idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=idx3(1:100); idx31=idx3(1:50); idx32=idx3(51:100); dm=d; dm(idx11,1)=0; % inserting missing data into variab le 1 dm(idx12,1)=0; % inserting missing data into variab le 1 dm(idx21,2)=0; % inserting missing data into variab le 2 dm(idx22,2)=0; % inserting missing data into variab le 1 dm(idx31,3)=0; % inserting missing data into variab le 1 dm(idx32,3)=0; % inserting missing data into variab le 3 %================================================== ============== % PREPROCESS DATA dm1=dm(1:m,1); dm2=dm(1:m,2); dm3=dm(1:m,3); dm1=dm1(dm1>0); dm2=dm2(dm2>0); dm3=dm3(dm3>0); dmm=[dm1 dm2 dm3]; rddmm=sqrt(dmm); %======================== % REWEIGHTING weight=rddmm; [v11,w11]=find(rddmm(:,1)<=0.2); [v12,w12]=find(rddmm(:,1)>0.2&rddmm(:,1)<=0.4); [v13,w13]=find(rddmm(:,1)>0.4&rddmm(:,1)<=0.6); [v14,w14]=find(rddmm(:,1)>0.6&rddmm(:,1)<0.8); [v15,w15]=find(rddmm(:,1)>0.8&rddmm(:,1)<1.0); idxv11=v11; idxv12=v12; idxv13=v13; idxv14=v14; idxv15=v15; weight(idxv11,1)=1; weight(idxv12,1)=0.75; weight(idxv13,1)=0.6; weight(idxv14,1)=0.45; weight(idxv15,1)=0.15; %================================== [v21,w21]=find(rddmm(:,2)<=0.2); [v22,w22]=find(rddmm(:,2)>0.2&rddmm(:,2)<=0.4); [v23,w23]=find(rddmm(:,2)>0.4&rddmm(:,2)<=0.6); [v24,w24]=find(rddmm(:,2)>0.6&rddmm(:,2)<=0.8); [v25,w25]=find(rddmm(:,2)>0.8&rddmm(:,2)<=1.0); idxv21=v21; idxv22=v22; idxv23=v23;
58
idxv24=v24; idxv25=v25; weight(idxv21,2)=1; weight(idxv22,2)=0.75; weight(idxv23,2)=0.6; weight(idxv24,2)=0.45; weight(idxv25,2)=0.15; %==================================== [v31,w31]=find(rddmm(:,3)<=0.2); [v32,w32]=find(rddmm(:,3)>0.2&rddmm(:,3)<=0.4); [v33,w33]=find(rddmm(:,3)>0.4&rddmm(:,3)<=0.6); [v34,w34]=find(rddmm(:,3)>0.6&rddmm(:,3)<=0.8); [v35,w35]=find(rddmm(:,3)>0.8&rddmm(:,3)<=1.0); idxv31=v31; idxv32=v32; idxv33=v33; idxv34=v34; idxv35=v35; weight(idxv31,3)=1; weight(idxv32,3)=0.75; weight(idxv33,3)=0.6; weight(idxv34,3)=0.45; weight(idxv35,3)=0.15; newdmm=(weight.*1); newx=5*ones(400,3)+newdmm; rdnew=newx-5*ones(400,n); sqrdnew=rdnew.^2; % CLASS 1 c1rew=sqrdnew(1:200,:); muc1rew=mean(c1rew); sigmac1rew=cov(c1rew); varc1rew=std(c1rew); % CLASS 2 c2rew=sqrdnew(201:400,:); muc2rew=mean(c2rew); sigmac2rew=cov(c2rew); varc2rew=std(c2rew); % T TEST % Hypothesis testing % h=0; cannot reject the null hypothesis at alpha l evel of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(sqrdnew(:,1),d(:,1),0.05) [h2,sig2,ci2]=ttest2(sqrdnew(:,2),d(:,2),0.05) [h3,sig3,ci3]=ttest2(sqrdnew(:,3),d(:,3),0.05) % TEST test=input(' enter test values ');
59
%g1=ln(p(x|c1))+ln(P(c1))=ln(ptc1)+ln(prc1) %p(x|c1)=exp(-0.5(test-Muc1)*inv(Muc1)*(test-Muc1))/(2*pi^(0.5*m/2)*(det(sigmac1^0.5)))=exp(-0.5*A*inv(Muc1)*A'/(2*pi^(0.5*m/2)*(det(sigmac1^0.5 )))); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2rew*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end %===================================== Ac1=(-0.5.*(test)*inv(sigmac1rew)*(test)'); Bc1=test*inv(sigmac1rew)*muc1rew'; Cc1=-0.5*muc1rew*inv(sigmac1rew)*muc1rew'; Dc1=log(det(sigmac1rew)); g1reg=Ac1+Bc1+Cc1-Dc1+log(prc1); Ac2=(-0.5.*(test)*inv(sigmac2rew)*(test)'); Bc2=test*inv(sigmac2rew)*muc2rew'; Cc2=-0.5*muc2rew*inv(sigmac2rew)*muc2rew'; Dc2=log(det(sigmac2rew)); g2reg=Ac2+Bc2+Cc2-Dc2+log(prc2); resreg=g1reg-g2reg; if resreg>0 disp('regression = c1') else disp('regression = c2') end %================================================== ========================