Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot...

Gernot.Liebchen@Brunel.ac.uk

Evaluating data quality issues from an industrial

data set

Gernot LiebchenBheki Twala

Mark Stephens Martin Shepperd

Michelle Cartwright

What is it all about?

• Motivations• Dataset – the origin & quality issues • Noise & cleaning methods• The Experiment• Issues & conclusion• Future Work

Motivations

• A previous investigation compared 3 noise handling methods (robust algorithms [pruning] , filtering, polishing)

• Predictive accuracy was highest with polishing followed by pruning and only then by filtering

• But suspicions were mentioned (at EASE)

Suspicions about previous investigation

• The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree)

• Polishing alters the data (What impact can that have?)

• The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?

Why do we bother?

• Good quality data is important for good quality predictions and assessments

• How can we hope for good quality results if the quality of the input data is not good?

• The data is used for a variety of different purposes – esp. analysis and estimation support

The Dataset

• Given a large dataset provided by a EDS• The original dataset contains more than 10

000 cases with 22 attributes• Contains information about software

projects carried out since the beginning of the 1990s

• Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity

Suspicions

• The data might contain noise • which was confirmed by the

preliminary analysis of the data which also indicated the existence of outliers.

How could it occur? (in the case of the dataset)

• Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous

• Misunderstood standards• The input tool might not provide range

checking (or maybe limited) • “Service Excellence” dashboard in head

quarters• Local management pressure

Suspicious Data Example

• Start Date: 01/08/2002 - 01/06/2002• Finish Date: 24/02/2004 - 09/02/2004• Name: *******Rel 24 - *******Rel 24 • FP Count: 1522 - 1522 • Effort: 38182.75 - 33461.5 • Country IRELAND - UK• Industry Sector Government - Government• Project Type Enhance. - Enhance.• Etc.• But there were also example with extremely high/low

FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)

What imperfections could occur?

• Noise – Random Errors• Outliers – Exceptional “True” Cases• Missing data• From now on Noise and Outliers will

be called Noise because both are unwanted

Noise Detectioncan be

• Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering)

• Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)

What to do with noise?

• First detection (we used decision trees- usually a pattern detection tool in data mining- but used to categorise the data in a training set and cases tested in a test set)

• 3 basic options of cleaning : Polishing, Filtering, Pruning

Polishing/Filtering/Pruning

• Polishing – identifying the noise and correcting it

• Filtering – Identifying the noise and eliminating it

• Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out

What did we do? & How did we do it?

• Compared the results of filtering and pruning and discussed a implications of pruning

• Reduced the dataset to eliminate cases with missing values (avoid missing value imputation)

• Produced lists of “noisy” instances and polished counterparts

• Passed them on to Mark ( as metrics specialist)

Results

• Filtering produced a list of 226 cases from 436

(36% in noise list/ in cleaned set 21%)

• Pruning produced a list of 191 from 436

(33% in noise list/ 25% in cleaned set)

• Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)

Results 2

• By just inspecting historical data it was not possible to judge which method performed better

• The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge

So what about polishing?

• Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances

• It makes them fit into the regression model

• Is this acceptable from the point of view of the data owner?- depends on the application of the results- What if unrealistic cases impact on the model?

Issues/Conclusions

• In order to build the models we had to categorise the dependent variable – 3 categories (<=1042,<= 2985.5,>2985.5) BUT these categories appeared to coarse for our evaluation of the predictions

• If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)

Where to go from here?

• Rerun the experiment without “unrealistic cases”

• Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is

What was it all about?

• Motivations• Dataset – the origin & quality issues • Noise & Cleaning methods• The Experiment• Issues & conclusion• Future Work

Any Questions?

Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot...

Documents

SOUTH AFRICAN CIVIL AVIATION AUTHORITY (SACAA) · SOUTH AFRICAN . CIVIL AVIATION AUTHORITY (SACAA) ICAO RPAS SYMPOSIUM . Mr SAM TWALA: Certification Engineer/UAS Specialist . 23 –

M.Sc. Christopher Liebchen Referenten: Tag der Einreichung ...tuprints.ulb.tu-darmstadt.de/8090/1/Liebchen-PhD-Advancing-Memor… · Prof. Dr.-Ing. Ahmad-Reza Sadeghi (Erstreferent)

Hecker / Steveling / Peuker / Kastner / Liebchen Color ... · PDF fileHecker / Steveling / Peuker / Kastner / Liebchen Color Atlas of Acupuncture Reading excerpt Color Atlas of Acupuncture

Uwe Liebchen sorgt Julia Jakob und für Überraschung … · (TTC Fortuna Passau) nunmehr über zwei Titel freuen. In den verschiedenen Klassen ... Aya Ume- mura und Csilla Batorfi

Encountering Bheki Mseleku - Stellenbosch University

WHAT BREEDS MAKE UP LIEBCHEN? - Pampered …...The Chihuahua was recognized as an official breed in 1904 by the American Kennel Club and by 1915, thirty Chihuahuas were registered

Letter to the Minister of Agriculture, Forestry & Fisheries 5pmg-assets.s3-website-eu-west-1.amazonaws.com/... · and his deputy, General Bheki Cele, the Director General, Dr Edith

Creative writing - Kerstin Liebchen

For preview only - Pioneer Drama Service · 2020. 10. 21. · Meine freunde (Mine-a froynd-a) = My friend Ja (Ya) = Yes Nein (Nine) = No Wunderbar (Voonderbar) = Wonderful Meine liebchen

ACCOUNTING - Primexecexams.co.za/SBA_Exemplars/ACCOUNTING LEARNER GUIDE.pdf · ACCOUNTING SCHOOL-BASED ASSESSMENT EXEMPLARS ... Bheki to control debtors (accounts receivable) and

Cycle Bases in Graphs Characterization, Algorithms ...Cycle Bases in Graphs Characterization, Algorithms, Complexity, and Applications Telikepalli Kavitha∗ Christian Liebchen†

Internal Newsletter of the Department of International ... · Agriculture, Forestry and Fisheries, General Bheki Cele; a South African business delegation; and researchers from South

Untitled-1 [cdn-s3.sappi.com] · Botha, Archie McKeIIar, Sven Karth, Frikkie Rousseau, Okkie Buys and Ben Buys. 6 Early players included Gabriel Nhlebela, Michael Zulu, Bheki Mhlongo,

Chapter 1 - arxiv.org · Chapter 1 Modelling chemotaxis ... 2 B. Liebchen and H. L owen strongly couples to chemical kinetics, ... The simplest case of a single bacterium (or "particle

s1d56729daeed71aa.jimcontent.com · Web viewMein Liebchen, da hast du gefehlt. Du hättest so hübsch, mein Schätzchen, Von deiner Liebe erzählt. Abb.: Teezirkel im 19 Jh. Für

Andreas Loven District Six - Losen Records...Nelson Mandela Scholarship, and has since recorded and performed with notable artists such as Herbie Tsoale, Bheki Khoza, Carlo Mombelli,

Isomeron: Code Randomization Resilient to (Just-In-Time ... · Isomeron: Code Randomization Resilient to (Just-In-Time) Return-Oriented Programming Lucas Davi, Christopher Liebchen,

For a better built environment 1 Driving Relevance of Built Environment Professions in the context of Africa’s Developmental Agenda By Bheki Zulu CEO –

Calendario delle proiezioni per il pubblico della 76 ... · EXTASE (ECSTASY) di Gustav MACHAT ... MOSESE (Lesotho, 120’, v.o. sesotho s/t inglese/italiano) con Mary Mhlongo Twala,

Adlis n tmaziɣt 3 - FreeMorocco.comfreemorocco.com/tamazight-dzayer/3-annee-moyenne.pdf · Tazwart Ha-tt-a tuweḍ-d twala n udlis-a akken ad as-nales. D acu kan, ɣas nules-as,