Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

DetectingDataErrors:Whereareweandwhatneedstobe

done?*PresentationBy:Sitong Che andSrikar Pyda

WrittenBy:Ziawasch Abedjan,XuChu,DongDeng,RaulCastroFernandez,Ihab F.Ilyas,Mourad Ouzzani,PaoloPapotti,MichaelStonebraker,NanTang

{abedjan,ddong,raulcf,stonebraker}@csail.mit.edu {x4chu,ilyas}@uwaterloo.ca {mouzzani,ntang}@qf.org.qa [email protected]

Introduction

• Amultitudeofdata-cleaningtoolsexisttodetectandpotentiallyrepairerrors• It’sbettertothinkofdata-cleaningsolutionsasbeingtailoredtodetectingparticularcategoriesoferrorsratherthandetectingallpotentialerrors• Data-cleaningisimportantforenterprisebecausedata-centricapproachesarebecomingcriticalforinnovationinbusinessandscience

• Differenttypesoferrorsoftenexistonthesamedata-set• Requirescleaningfrommultipletoolsinordertodetectandrepairthevarietyofnuancesintheerrors

Overview:PragmaticQuestion

Arethesetoolsrobustenoughtocapturemosterrorsinreal-worlddatasets?

Whatisthebeststrategytoholisticallyrunmultipletoolstooptimizethedetection

effort?

DataCleaningSolutionCategories• Rule-baseddetectionalgorithm:Userscanspecifyacollectionofdata-cleaningrulesandthetoolwillfindanyviolationswithinthedata-set• NADEEF• “notnull”constraint• Multi-attributefunctionaldependencies(FDs)• User-definedfunctions

• Patternenforcementandtransformationtools:Patternenforcementtoolsdiscoverbothsemanticandsyntacticpatternsinthedataandusethemtodetecterrors.Transformationtoolscanbeusedtochangethedatarepresentationandexposeadditionalpatternswithinthedata-set.• Syntactic:OPENREFINEandTRIFACTA• Semantic:Katara

• Quantitativeerrordetection:Thesealgorithmsexposeoutliersandotherstatisticalglitcheswithinthedata.• Recordlinkageandde-duplicationalgorithms:De-duplicationtoolsdetectduplicatedatarecordswhichreferto thesameentity.ConflictingValuescanbefoundàfurther indicatingerror.• TAMR

• Arethesecategoriessufficient?Overlap?• Theauthorsadmitthattheircategorizationdoesnotperfectlypartitionerrors• Theauthorsareattemptingtocategorizedetectableerrorsfromreal-worlddata-sets

Data-CleaningChallenges

• Syntheticdataanderrors:Mostcleaningalgorithmsareevaluatedondata,eithersyntheticorreal-world,withsyntheticallyinjectederror• Thereisbothalackofrealdata-setsalongwithappropriategroundtruthandalackofwidelyacceptedbenchmarkofdata-cleaningquality

• Combinationoferrortypesandtools:Real-worlddataoftenhasmultiplekindsoferrors.• Anerrorcanoftenbefoundthroughamultitudeoftools

• Conflictingduplicaterecordsandintegrityconstraint• Overlap??

• HumanInvolvement:Enterprisesrequirebudgetstofacilitatehumanpower—havinganidealorderingfortheapplicationofdata-cleaningtoolsiskeytominimizehumanintervention.• Verifydetectederrors• Specifycleaningrules• Providefeedbackformachineleaning

Overview:Methodology

• Collectionofreal-worlddatawitheitherfullorpartialgroundtruth• Representthekindsofdirtydatafoundinpractice• Theauthorscaneasilyjudgetheperformancebecauseoftheknowledgeofgroundtruth

• Interestedinautomaticerrordiscoveryasopposedtoautomaticrepairbecauseauto-repairisnotpragmaticinpractice.• Reportresultsintermsofprecisionandrecallintermsofthegroundtruth.• Upper-boundrecall:estimateforthemaximumrecallofatoolifithasbeenconfiguredbyanoracle• ”Perfect-configuration”ofdata-cleaningrulestoenableoptimalerrordetection• Usegroundtruthtoestimateupper-boundrecall:classifyremainingerrorsthatarenotdetectedbytype

• Anyerrorwhosetypecanbecleanedbyatoolshouldbecountedtowardsitsrecall.

EvaluationQuestionsWhatistheprecisionandrecallofeachtool?Howprevalentaretheerrorswhichthedata-cleaningtoolisdesignedtodetect?Howmanyerrorsinthedatasetsaredetectablebyapplyingalltoolscombined?Sincehuman-in-the-loopisawellacceptedparadigm,howmanyfalsepositivesarethere.Thesecauseadraininhumaneffortbudgetandcauseacleaningefforttofail.Isthereastrategytominimizehumaneffortbyleveragingtheinteractionamongtools?

MainFindings

• Conclusion1:Thereisnosingledominanttoolbecausethedata-cleaningalgorithmsaregenerallytailoredtowardsparticulartypesoferrors• Aholistic“composite”strategymustbeusedbecauseeachdata-cleaningtoolisindividuallydesignedtodetectselectivegenresoferrors

• Conclusion2:Byassessingtheoverlapoferrorsdetectedbythemultitudeofdata-cleaningtoolsutilizedinordertominimizefalsepositive(userengagement)• Orderingstrategymustbespecifictothedata-setbecauseofvarianceinstructuralpropertiesandpatterns

• Conclusion3:Thepercentageoferrorsthatcanbefoundbythecombinedorderedapplicationofalltoolsissignificantlylessthan100%.• Willdiscussadditionalerrorslaterinexperiments—researchersneedtodevelopnewwaysoffindingthese”unknowncategories”ofdataerrors(oneswhichcanbespottedbyhumansbutnotbythecurrentcleaningtools

DataErrorsandDataSets• DataError:Givenadataset,adataerrorisancell-valuewhich

isdifferentfromitsgivengroundtruth• Outliererrors:Cell-valueswhichdeviatefromthedistribution

ofovertherangeofvaluesinacolumnofatable.• Duplicateerrors:Distinctdatabaseentries/recordswhichrefer

tothesamereal-worldentity.• Ifthetwoentries’attributevaluesdonotmatchthat

couldindicateanerror.• Ruleviolationerrors:Cell-valuesthatviolateanykindof

integrityconstraints• NotNull&UniquenessConstraints

• Patternviolation:Valuesthatviolatesyntacticandsemanticconstraints• Alignment,formatting,misspelling,andsemanticdata-

types

DataSetsOverview• Thefourerror-typesaregenerallyprevalentacrossalldata-sets• TheAnimaldata-setdoesnothaveoutliers• MITVPFandBLACKOAKaretheonlydata-setswithduplicates

• Theratiooferroneouscellsineachdata—setrangefrom0.1%to34%• Structuralproperties:

• Rayyan BibhasthehighestpercentageoferrorswhileAnimalhaslowest

• Merckhasthegreatestnumberofattributes• BlackOak hasthegreatestnumberofentries

MIT-VPF• MITOfficeoftheVicePresidentforFinance’s(VPF)procurementdatabasewhichcontainsinformationaboutvendorsandindividualsthatsupplyMITwithproductsandsupplies

• StructuralDetails:• ExecutePurchaseOrder:newentryisaddedwithdetailsaboutthe

contractingpartytothevendormasterdata-setwheneverMITbuysaproduct• Identificationinformation(name,address,phonenumber,businesscodes)

• Theongoingprocessofaddingcreatesauniqueproblemofduplicatesandotherdata-errors(theory)• Inconsistentformatting(address,phonenumber,companynames)• Contactinformationmaychangeovertime

• Groundtruth:EmployeesofVPFmanuallycuratedarandomsampleof13,603records(halfofthedata-set)andmarkederroneousfields(empirics)• addressandcompanynames:missingstreetnumbers,wrong

capitalization,andattributevaluesinthewrongcolumn

Merck• TheMerckdatasetdescribesITservicesandsoftwaresystemswithinthecompanythatarepartiallymanagedwiththirdparty—usedforoptimizationofdownsizingservice• StructuralDetails:Eachsystemischaracterizedbylocation,numberofendusers,andleveloftechnicalsupport• Greatestnumberofattributes(68)butisverysparse

• GroundTruth:Merckprovidedthecustomcleaningscripttheyusedtoproduceacleanedversionofthedata-set• Appliesvariousdatatransformationsthatnormalizecolumnsandallowforuniformvaluerepresentation

• Theauthorsutilizedthescripttoformulaterulesandtransformationsforcleaningtools• Therearemanyhiddenfunctioncallsthatareimplicitlycalled

whichchangedata-values

Animal• Animaldata-setprovidedbyscientistsatUCBerkeleyabouttheeffectsoffirewoodcuttingonsmallterrestrialvertebrates• StructuralProperties

• Eachentrycontainsinformationaboutthetimeandlocationofcaptureofananimal,inadditiontoitsproperties:tagnumber,sex,weight,species,andagegroup

• Eachrecordwasmanuallyenteredintospreadsheetsfrombeinginitiallytranscribedonpaper(datafrom1993-2012years)

• Groundtruth:Thescientistsmanuallycleanedthedata-setandidentifiedseveralhundredsoferroneouscells• Errors:

• Shiftedfields• Wrongnumericvalues

Rayyan-Bib• Rayyan isasystembuiltatQCRItoassistscientistsintheproductionofsystematicreviews• literaturereviewswhichidentifyandsynthesizeallresearchevidencerelatedtoanuancedresearchquestion

• StructuralProperties:• Usersconsolidatesearchresultsintolonglistsofreferencestostudieswhichtheyfeelarerelevanttoansweringthequestion• Searchingmultipledatabasesusingmultiplequeries• Userscanmanuallymanipulatecitationssodataispronetoerror

• Entrieshavealotofattributes:article_title,journal_title,journal_abbreviation etc

• Groundtruth:Theauthorsmanuallycheckedasampleof1,000referencesfromRayyan’s database• Manymissingvaluesandinconsistenciesindata

• Journal_title andjornal_abbreviation areoftenswitched• Authornamesaresometimesfoundinjournal_title

BlackOak• BlackOak Analyticsisacompanywhichprovidesentityresolutionsolutions• StructuralProperties:Providedanonymizedaddressdatasetandadirtyversionwhichtheyuseforevaluation• Groundtruthisgivenbecauseit’sasyntheticdata-set• Errorsarerandomlydistributed

• Errors:• Spellingofvalues• Formattingofvalues• Completeness• Fieldseparation

• Theauthorsuniquelyincludedthisdata-settoanalyzethedifferenceinerrordetectionperformancebetweenrealworldandsyntheticdatasets

DataCleaningTools• Selecteddata-cleaningtoolswhichcoveredallfourerrortypes• Multipletoolssometimesfocusondifferentsubtypesofagivenerrortype• Iterativefine-tuningprocessforeachtool• Comparedetectederrorswithgroundtruthinordertoadjustthetoolconfigurationorrulesinordertoimproveperformance• Detectableerrorsarecountedtowardstherecallupperbound

Strategy1:OutlierDetection

• Detectdatavalueswhichdonotfollowthestatisticaldistributionoftheoveralldata• Tool1:Dboost

• Unique:Decomposesrun-ondatatypes(date)intotheirconstituentpieces(m,y,d)• Attributeswhicharewrappedinmorecomplexdatacanbeindividuallyanalyzedseparatelyforoutliers

• Histogramscreateadistributionofthedatawithoutanyapriori assumptionbycountingtheoccurrencesofuniquedata-values

• GaussianandGGMassumethateachvaluewasdrawnfromanormaldistributionwithgivenameanandstandarddeviationoramultivariateGaussiandistributionrespectively.

• OptimalParameterConfiguration:• Numberofbins&theirwidthsforhistograms• Mean&StandardDeviationforGaussianandGMM

Strategy2:Rule-basedErrorDetection

• Relyondata-qualityrulestodetecterrors:expressedusingintegrityconstraints• FunctionalDependencies• DenialConstraints• Violation:Collectionofcellsthatdonotconformtoagivenintegrityconstraint• Atleastonecellinvolvedintheviolationmustbechangedtoresolveaviolation

• Tool2:DC-Clean• Focusesondenialconstraints• TheauthorsdesignacollectionofDCstocapturethesemanticsofthedata• “iftherearetwocapturesofthesameanimalindicatedbythesametagnumber,thenthefirstcapturemustbemarkedasoriginal”

Strategy3:Pattern-basedDetection

• Tool3:OPENREFINE:Opensourcedatawranglingtool• Tool4:TRIFECTA:Communityversionofacommercialdatawranglingtool

• OPENRIFEANDTRIFECTAfocusonsyntacticpatterns:provideexplorationtechniquestodiscoverdata-inconsistencies

• Tool5:KATARA:Semanticpatterndiscoveryanddetectiontool• Focusesonsemanticpatternsmatchedagainstaknowledgebase

• ETL(Extract,Transform,Load)tools:pulldataoutofonedatabaseandplaceitinanother• Tool6:KNIME• Tool7:PENTAHO

Tool3:OPENREFINE

• OPENREFINEisanopensourcewranglingtoolthatcandigestdatainmultipleformats--facilitatesdata-exploration• FacetingOperation:Letsuserslookatdifferentkindsofaggregateddata—resemblesagroupingoperation• TheuserspecifiesonecolumnsforfacetingandOPENREFINEgeneratesawidgetthatshowsalldistinctvalues&theirnumberofoccurrences

• Filteringoperation:• TheusercanspecifyanexpressiononmultiplecolumnsandOPENREFINEgeneratesthewidgetbasedonvaluesoftheexpression

• TheusercanthenselectoneormorevaluesinthewidgetandOPENREFINEfiltersrowswhichdonotcontainselectedvalues

• Datacleaningusesaneditingoperation• Editsonecellatatime• Ifyoueditatextfacet,allcellsconsistentwiththatfacetwillupdate

Tool4:TRIFACTA

• TRIFACTAisthecommercialdescendantofDataWrangler:Predictsandappliesvarioussyntacticdata-transformationsfordatapreparationandcleaning.• Canapplybusiness&standardizationrulesthroughavailabletransformationscripts

• Appliesafrequencyanalysistoeachcolumntoidentifymostandleastfrequentvalues• Showsattributevaluesthatdeviatestronglyfromthevaluedistributioninthespecificattribute• Mapseachattributetoitsmostprominentdata-typeandidentifiesvaluesthatdonotmatch

Tool5:KATARA

• KATARAreliesonexternalknowledgebases,suchasYago,todetect&correcterrorswhichviolateasemanticpattern• Identifiesthetypeofacolumnandtherelationshipbetweentwocolumnsinthedata-setusingaknowledgebase• ThetypecolumnAinatablemightcorrespondtoCountryinknowledgebaseYago &therelationshipbetweencolumnsAandBmightcorrespondtothepredicateHasCapital inYqgo

• Basedonthediscoveredtypesandrelationship,Katara validatesvaluesusingtheknowledgebaseandhumanexperts• Exampe:Avalueof”California”incolumnAwillbemarkedasanerrorbecauseitisnotacountryinYago

Tool6:PENTAHO

• PENTAGOprovidesagraphicalinterfacewheredatawranglingcanbeimplementedasadirectedgraphofETL(Extract,Transform,load)operations• Anydata-manipulationorrulevalidationcanbeaddedasanodeintheETLpipelines• ExecutesmultipleETLworkflowstoclean/curatedataBUTrules/proceduresmustbespecifiedbyuser• Providesroutinesforstringtransformationandsinglecolumnconstraintvalidation

Tool7:KNIME

• Knime focusesonworkflowauthoringandencapsulatingdataprocessingtaskssuchascurationandmachinelearningbasedfunctionalityincompassablenodes• AlthoughKNIMEexecutesmultipleETLworkflows,similartoPENTAGO,toclean/curatedata,rules/proceduresmustbespecifiedbyuser• Usersmustknowexactlywhatkindsofrulesandpatternsneedtobeverified• UnlikeOPENREFINE&TRIFACTA,PENTAHOandKNIMEdonotprovidewaystoautomaticallydisplayoutliersanddetecttypemismatches

Strategy4:DuplicateDetection

• Iftworecordsrefertothesamereal-worldentity,buthavedifferingattributevalues,thereastrongchanceoneofthetwovaluesforeachrespectiverecordisanerror• Tool8:TAMR(commercialdescendantofDataTamersystem)• TAMRisatoolwithindustrialstrengthdataintegrationalgorithmsforrecordlinkageandschemamapping• Premisedonmachinelearningmodelsthatlearnduplicatefeatures

• Expertsourcing• SimilarityMetrics

CombinationofMultipleTools

• Problem:Howdoesauserproperlycombinemultipleindependentdata-cleaningtools• Option1: Runalltoolsandapplyaunionormin-kstrategy• Option2:Haveusersmanuallycheckasampleofdetectederrors,whichcanbeusedtoguidetheprioritizationofdata-cleaningoperations

Option1:UnionAllandMin-k

• Unionall• Takestheunionoftheerrorsemittedbyalltools

• Min-k• Thoseerrorsdetectedbyatleastk-toolswhileexcludingthosedetectedbylessthank-tools• Noneedtokeepcleaningthedata-setwithnewtechniquesifthemaximumperformanceforerrordetectionhasbeenreached

OrderingBasedonPrecision

• ProblemswithOption1(exhaustiveunion)• Expensivebecauseitrequiresmassiveamountsofhumanefforttovalidatelargenumberofcandidateerrors• BlackOak data-set:Auserwouldhavetoverifythe982,000cellsidentifiedaspossiblyerroneoustodiscover382,928actualerrors.• Resultsfromtoolswithpoorperformanceinerrordetectionforthisparticulardata-setshouldnotbeevaluted

• Alternative:Sampling-basedmethodtoselecttheorderinwhichdata-cleaningstrategieswillbeimplementedonthedata-set

OrderingBasedonPrecision• CostModel:Althoughtheperformanceofatoolcanbemeasuredbyprecisionandrecallindetectingerrors,Precisionisabetterproxyforadata-cleansingtool’serrordetectionperformance• Recallcanonlybecomputedifalloftheerrorsinthedataareknown(fullgroundtruth)—thisisnearlyimpossiblewhenweexecuteerrordetectionstrategiesonnewdata-sets

• Precisioniseasytoestimate• AssumeC isthecostofhavingahumancheckadetectederrorandthatVisthevalueofidentifyingarealerror• Valuemustbehigherthancost!• P*V>(P+N)*C,wherePisthenumberofcorrectlydetectederrorsandNisthenumberoferroneouslydetectederrors(falsepositives)

• P/(P+N)>C/V• Setthreshold:σ =C/V

OrderingBasedonPrecision• Anytoolwithaprecisionbelowσ shouldnotrunbecausethecostofcheckingisgreaterthanthevalueofaccuratelyidentifyingadata-error• Theratioisdomaindependent(unknowninmostcases);itisnaturaltohavelargeVvaluesforhighlyvaluabledata

• IfVisverylarge,alldata-cleansingtoolswillbeconsideredwiththecorrespondinglysmallthresholdvalue,whichboostsrecall

• IfthevalueofCishighanddominatestheratio,wesavecostonlyonthevalidationoftoolsthatareveryprecise—tradeoffwithrecall

• Theauthorsestimatetheprecisionoftheirdata-cleansingtoolonagivendata-setbycheckingarandomsampleofthedetectederrors.• Whynotrunallthetoolswithaprecisionhigherthanthresholdandevaluatetheunionofalltheirdetectederrorsets??• Toolsarenotindependentandsetsofdetectederrorsmayandoftendooverlap• Sometoolsmayhaveanextremelyhighprecision,butalloftheerrorstheydetectmaybecoveredwithtoolsthathaveevenhighprecisionvalues

OrderingBasedonPrecision

• Maximumentropy-basedorderselection:FollowingtheMaximumEntropyprinciple,theauthorsdesignanalgorithmwhichassessestheestimateprecisionforagivendata-cleansingtool• Estimatesoverlapbetweentoolresults• Pickingthetoolwithhighestprecision(percentageofpositiveswhicharetrueovertotal)reducesentropythemostbecausehighentropyreferstouncertainty

OrderingBasedonPrecision:Algorithm

1. Runalldata-cleaningtoolsontheentiredata-setandreturndetectederrors2. Estimateprecisionforeachtoolbyverifyingarandomsampleofitsdetected

errorswithahumanexpert3. Pickthetoolwiththehighestestimatedprecisionamongalldata-cleaningtools

notyetconsideredinordertomaximizeentropy&verifiesdetectederrorsonthecompletedata-setwhichhavenotbeenverifiedbefore

4. SinceerrorsvalidatedfromStep3mayhavebeendetectedbyothertools,weupdatetheprecisionoftheothertoolsandgotoS3topickthenexttoolifadata-cleaningstrategyexistswithanestimatedprevision> σ

Empirics:Regardlessofeachtool’sindividualperformance,theproposedorderreducescostofmanualverificationwithmarginalreductionofrecall.

EvaluationMetrics

• D:dataset• G:purelycleaneddataset• E:diff(G,D)=E• T(D):thesetofcellsmarkedaserros bytoolT• Precision:• Recall:• AggregatedF:2(R*P)/(R+P)

UsageofTools

• DBOOST:appliedthreealgorithms:Gaussian,histogram,GMM.WithparametersmakingFhighest.• DC-Clean: existing rules + manuallyconstructedFDrulesbasedonobviousn-to-1relationships• OpenRefine: facetmechanism+ formattingandsingle-columnrules• TRIFACTA: outlierdetectionandtype-verification+ formattingandsingle-columnrules• KATARA: manuallyconstructed& existing knowledgebase• PENTAHO & KNIME: modeleachtransformationandvalidationroutineasaworkflownodeintheETLprocess• TAMR: iterate training until the precision and recall become stable

Dataqualityrulesdefinedoneachdataset

UserInvolvement

• Setrules• PerformdataexplorationusingOpenRefine andTRIFCTA• Validatetheresultoferrors• Gothroughtheremainingerrorsandtrytocategorizethem

IndividualEffectiveness

IndividualEffectiveness

• DBOOST:useless for Animal. Good for BlackOak• DC-Clean: good for Animal, Merck. Bad for MIT VPF• OpenRefine: bad for Animal, top 2 for others• TRIFACTA: bad for Animal recall, top 2 for others• KATARA: good for BlackOak, bad for MIT VPF• PENTAHO & KNIME: good on general• TAMR: found all duplicates for MIT VPF, and most of duplicates forBlackOak

Tool Combination Effectiveness

• Union All: High recall but low precision (lots of FP)

Min-K

• Require at least k algorithms agree on error• (K=1) == union all• As k increases, precision increases, recall decreases• Main problem: how to pick k

OrderingbasedonBenefitandUserValidation• Randomly sample 5% of the detected errors for each tool andcompare them with ground truth for precision estimation.• Run tools in precision order (dynamically update the precisionestimation and drop tools that did not pass )• Baseline: simple union• Threshold: σ(0.1-0.5) (for precision)• As threshold increases, precision increases, FP decrease significantly,with TP decrease a little, causing recall decrease a little

Ordering Strategy results

Recall Upper-bound

• extra rules found by manually going through remaining errors

Domain Specific Tools

• For MIT VPF and BlackOak: ADDRESSCLEANER• Apply on a 1000 sample• Found 2 & 13 new errors. Recall: 0.93-0.95; 0.999-0.999

Enrichment

• Manually add more attributes to the original dataset (only those thatdid not introduce additional duplicate rows)• DC-Clean & TAMR

Conclusion

• Thereisnosingledominanttoolforthevariousdatasetsanddiversifiedtypesoferrors.Singletoolsachievedonaverage47%precisionand36%recall, showingthatacombinationoftoolsisneededtocoveralltheerrors.• Pickingtherightorderinapplyingthetoolscanimprovetheprecisionandhelpreducethecostofvalidationbyhumans.• Domainspecifictoolscanachievehighprecisionandrecallcomparedtogeneral-purposetools,achievingonaverage71%precisionand64%recall,butarelimitedtocertaindomains• Rule-basedsystemsandduplicatedetectionbenefitedfromdataenrichment.Inourexperiments,weachievedanimprovementofupto10%moreprecisionand7%morerecall

Documents

Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories