View
213
Download
0
Category
Preview:
Citation preview
Gernot.Liebchen@Brunel.ac.uk
Evaluating data quality issues from an industrial
data set
Gernot LiebchenBheki Twala
Mark Stephens Martin Shepperd
Michelle Cartwright
Gernot.Liebchen@Brunel.ac.uk
What is it all about?
• Motivations• Dataset – the origin & quality issues • Noise & cleaning methods• The Experiment• Issues & conclusion• Future Work
Gernot.Liebchen@Brunel.ac.uk
Motivations
• A previous investigation compared 3 noise handling methods (robust algorithms [pruning] , filtering, polishing)
• Predictive accuracy was highest with polishing followed by pruning and only then by filtering
• But suspicions were mentioned (at EASE)
Gernot.Liebchen@Brunel.ac.uk
Suspicions about previous investigation
• The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree)
• Polishing alters the data (What impact can that have?)
• The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?
Gernot.Liebchen@Brunel.ac.uk
Why do we bother?
• Good quality data is important for good quality predictions and assessments
• How can we hope for good quality results if the quality of the input data is not good?
• The data is used for a variety of different purposes – esp. analysis and estimation support
Gernot.Liebchen@Brunel.ac.uk
The Dataset
• Given a large dataset provided by a EDS• The original dataset contains more than 10
000 cases with 22 attributes• Contains information about software
projects carried out since the beginning of the 1990s
• Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity
Gernot.Liebchen@Brunel.ac.uk
Suspicions
• The data might contain noise • which was confirmed by the
preliminary analysis of the data which also indicated the existence of outliers.
Gernot.Liebchen@Brunel.ac.uk
How could it occur? (in the case of the dataset)
• Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous
• Misunderstood standards• The input tool might not provide range
checking (or maybe limited) • “Service Excellence” dashboard in head
quarters• Local management pressure
Gernot.Liebchen@Brunel.ac.uk
Suspicious Data Example
• Start Date: 01/08/2002 - 01/06/2002• Finish Date: 24/02/2004 - 09/02/2004• Name: *******Rel 24 - *******Rel 24 • FP Count: 1522 - 1522 • Effort: 38182.75 - 33461.5 • Country IRELAND - UK• Industry Sector Government - Government• Project Type Enhance. - Enhance.• Etc.• But there were also example with extremely high/low
FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)
Gernot.Liebchen@Brunel.ac.uk
What imperfections could occur?
• Noise – Random Errors• Outliers – Exceptional “True” Cases• Missing data• From now on Noise and Outliers will
be called Noise because both are unwanted
Gernot.Liebchen@Brunel.ac.uk
Noise Detectioncan be
• Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering)
• Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)
Gernot.Liebchen@Brunel.ac.uk
What to do with noise?
• First detection (we used decision trees- usually a pattern detection tool in data mining- but used to categorise the data in a training set and cases tested in a test set)
• 3 basic options of cleaning : Polishing, Filtering, Pruning
Gernot.Liebchen@Brunel.ac.uk
Polishing/Filtering/Pruning
• Polishing – identifying the noise and correcting it
• Filtering – Identifying the noise and eliminating it
• Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out
Gernot.Liebchen@Brunel.ac.uk
What did we do? & How did we do it?
• Compared the results of filtering and pruning and discussed a implications of pruning
• Reduced the dataset to eliminate cases with missing values (avoid missing value imputation)
• Produced lists of “noisy” instances and polished counterparts
• Passed them on to Mark ( as metrics specialist)
Gernot.Liebchen@Brunel.ac.uk
Results
• Filtering produced a list of 226 cases from 436
(36% in noise list/ in cleaned set 21%)
• Pruning produced a list of 191 from 436
(33% in noise list/ 25% in cleaned set)
• Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)
Gernot.Liebchen@Brunel.ac.uk
Results 2
• By just inspecting historical data it was not possible to judge which method performed better
• The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge
Gernot.Liebchen@Brunel.ac.uk
So what about polishing?
• Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances
• It makes them fit into the regression model
• Is this acceptable from the point of view of the data owner?- depends on the application of the results- What if unrealistic cases impact on the model?
Gernot.Liebchen@Brunel.ac.uk
Issues/Conclusions
• In order to build the models we had to categorise the dependent variable – 3 categories (<=1042,<= 2985.5,>2985.5) BUT these categories appeared to coarse for our evaluation of the predictions
• If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)
Gernot.Liebchen@Brunel.ac.uk
Where to go from here?
• Rerun the experiment without “unrealistic cases”
• Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is
Gernot.Liebchen@Brunel.ac.uk
What was it all about?
• Motivations• Dataset – the origin & quality issues • Noise & Cleaning methods• The Experiment• Issues & conclusion• Future Work
Gernot.Liebchen@Brunel.ac.uk
Any Questions?
Recommended