49
Detecting Data Errors: Where are we and what needs to be done?* Presentation By: Sitong Che and Srikar Pyda Written By: Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang {abedjan, ddong, raulcf, stonebraker}@csail.mit.edu {x4chu, ilyas}@uwaterloo.ca {mouzzani, ntang}@qf.org.qa [email protected]

Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

DetectingDataErrors:Whereareweandwhatneedstobe

done?*PresentationBy:Sitong Che andSrikar Pyda

WrittenBy:Ziawasch Abedjan,XuChu,DongDeng,RaulCastroFernandez,Ihab F.Ilyas,Mourad Ouzzani,PaoloPapotti,MichaelStonebraker,NanTang

{abedjan,ddong,raulcf,stonebraker}@csail.mit.edu {x4chu,ilyas}@uwaterloo.ca {mouzzani,ntang}@qf.org.qa [email protected]

Page 2: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Introduction

• Amultitudeofdata-cleaningtoolsexisttodetectandpotentiallyrepairerrors• It’sbettertothinkofdata-cleaningsolutionsasbeingtailoredtodetectingparticularcategoriesoferrorsratherthandetectingallpotentialerrors• Data-cleaningisimportantforenterprisebecausedata-centricapproachesarebecomingcriticalforinnovationinbusinessandscience

• Differenttypesoferrorsoftenexistonthesamedata-set• Requirescleaningfrommultipletoolsinordertodetectandrepairthevarietyofnuancesintheerrors

Page 3: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Overview:PragmaticQuestion

Arethesetoolsrobustenoughtocapturemosterrorsinreal-worlddatasets?

Whatisthebeststrategytoholisticallyrunmultipletoolstooptimizethedetection

effort?

Page 4: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

DataCleaningSolutionCategories• Rule-baseddetectionalgorithm:Userscanspecifyacollectionofdata-cleaningrulesandthetoolwillfindanyviolationswithinthedata-set• NADEEF• “notnull”constraint• Multi-attributefunctionaldependencies(FDs)• User-definedfunctions

• Patternenforcementandtransformationtools:Patternenforcementtoolsdiscoverbothsemanticandsyntacticpatternsinthedataandusethemtodetecterrors.Transformationtoolscanbeusedtochangethedatarepresentationandexposeadditionalpatternswithinthedata-set.• Syntactic:OPENREFINEandTRIFACTA• Semantic:Katara

• Quantitativeerrordetection:Thesealgorithmsexposeoutliersandotherstatisticalglitcheswithinthedata.• Recordlinkageandde-duplicationalgorithms:De-duplicationtoolsdetectduplicatedatarecordswhichreferto thesameentity.ConflictingValuescanbefoundàfurther indicatingerror.• TAMR

• Arethesecategoriessufficient?Overlap?• Theauthorsadmitthattheircategorizationdoesnotperfectlypartitionerrors• Theauthorsareattemptingtocategorizedetectableerrorsfromreal-worlddata-sets

Page 5: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Data-CleaningChallenges

• Syntheticdataanderrors:Mostcleaningalgorithmsareevaluatedondata,eithersyntheticorreal-world,withsyntheticallyinjectederror• Thereisbothalackofrealdata-setsalongwithappropriategroundtruthandalackofwidelyacceptedbenchmarkofdata-cleaningquality

• Combinationoferrortypesandtools:Real-worlddataoftenhasmultiplekindsoferrors.• Anerrorcanoftenbefoundthroughamultitudeoftools

• Conflictingduplicaterecordsandintegrityconstraint• Overlap??

• HumanInvolvement:Enterprisesrequirebudgetstofacilitatehumanpower—havinganidealorderingfortheapplicationofdata-cleaningtoolsiskeytominimizehumanintervention.• Verifydetectederrors• Specifycleaningrules• Providefeedbackformachineleaning

Page 6: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Overview:Methodology

• Collectionofreal-worlddatawitheitherfullorpartialgroundtruth• Representthekindsofdirtydatafoundinpractice• Theauthorscaneasilyjudgetheperformancebecauseoftheknowledgeofgroundtruth

• Interestedinautomaticerrordiscoveryasopposedtoautomaticrepairbecauseauto-repairisnotpragmaticinpractice.• Reportresultsintermsofprecisionandrecallintermsofthegroundtruth.• Upper-boundrecall:estimateforthemaximumrecallofatoolifithasbeenconfiguredbyanoracle• ”Perfect-configuration”ofdata-cleaningrulestoenableoptimalerrordetection• Usegroundtruthtoestimateupper-boundrecall:classifyremainingerrorsthatarenotdetectedbytype

• Anyerrorwhosetypecanbecleanedbyatoolshouldbecountedtowardsitsrecall.

Page 7: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

EvaluationQuestionsWhatistheprecisionandrecallofeachtool?Howprevalentaretheerrorswhichthedata-cleaningtoolisdesignedtodetect?Howmanyerrorsinthedatasetsaredetectablebyapplyingalltoolscombined?Sincehuman-in-the-loopisawellacceptedparadigm,howmanyfalsepositivesarethere.Thesecauseadraininhumaneffortbudgetandcauseacleaningefforttofail.Isthereastrategytominimizehumaneffortbyleveragingtheinteractionamongtools?

Page 8: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

MainFindings

• Conclusion1:Thereisnosingledominanttoolbecausethedata-cleaningalgorithmsaregenerallytailoredtowardsparticulartypesoferrors• Aholistic“composite”strategymustbeusedbecauseeachdata-cleaningtoolisindividuallydesignedtodetectselectivegenresoferrors

• Conclusion2:Byassessingtheoverlapoferrorsdetectedbythemultitudeofdata-cleaningtoolsutilizedinordertominimizefalsepositive(userengagement)• Orderingstrategymustbespecifictothedata-setbecauseofvarianceinstructuralpropertiesandpatterns

• Conclusion3:Thepercentageoferrorsthatcanbefoundbythecombinedorderedapplicationofalltoolsissignificantlylessthan100%.• Willdiscussadditionalerrorslaterinexperiments—researchersneedtodevelopnewwaysoffindingthese”unknowncategories”ofdataerrors(oneswhichcanbespottedbyhumansbutnotbythecurrentcleaningtools

Page 9: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

DataErrorsandDataSets• DataError:Givenadataset,adataerrorisancell-valuewhich

isdifferentfromitsgivengroundtruth• Outliererrors:Cell-valueswhichdeviatefromthedistribution

ofovertherangeofvaluesinacolumnofatable.• Duplicateerrors:Distinctdatabaseentries/recordswhichrefer

tothesamereal-worldentity.• Ifthetwoentries’attributevaluesdonotmatchthat

couldindicateanerror.• Ruleviolationerrors:Cell-valuesthatviolateanykindof

integrityconstraints• NotNull&UniquenessConstraints

• Patternviolation:Valuesthatviolatesyntacticandsemanticconstraints• Alignment,formatting,misspelling,andsemanticdata-

types

Page 10: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

DataSetsOverview• Thefourerror-typesaregenerallyprevalentacrossalldata-sets• TheAnimaldata-setdoesnothaveoutliers• MITVPFandBLACKOAKaretheonlydata-setswithduplicates

• Theratiooferroneouscellsineachdata—setrangefrom0.1%to34%• Structuralproperties:

• Rayyan BibhasthehighestpercentageoferrorswhileAnimalhaslowest

• Merckhasthegreatestnumberofattributes• BlackOak hasthegreatestnumberofentries

Page 11: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

MIT-VPF• MITOfficeoftheVicePresidentforFinance’s(VPF)procurementdatabasewhichcontainsinformationaboutvendorsandindividualsthatsupplyMITwithproductsandsupplies

• StructuralDetails:• ExecutePurchaseOrder:newentryisaddedwithdetailsaboutthe

contractingpartytothevendormasterdata-setwheneverMITbuysaproduct• Identificationinformation(name,address,phonenumber,businesscodes)

• Theongoingprocessofaddingcreatesauniqueproblemofduplicatesandotherdata-errors(theory)• Inconsistentformatting(address,phonenumber,companynames)• Contactinformationmaychangeovertime

• Groundtruth:EmployeesofVPFmanuallycuratedarandomsampleof13,603records(halfofthedata-set)andmarkederroneousfields(empirics)• addressandcompanynames:missingstreetnumbers,wrong

capitalization,andattributevaluesinthewrongcolumn

Page 12: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Merck• TheMerckdatasetdescribesITservicesandsoftwaresystemswithinthecompanythatarepartiallymanagedwiththirdparty—usedforoptimizationofdownsizingservice• StructuralDetails:Eachsystemischaracterizedbylocation,numberofendusers,andleveloftechnicalsupport• Greatestnumberofattributes(68)butisverysparse

• GroundTruth:Merckprovidedthecustomcleaningscripttheyusedtoproduceacleanedversionofthedata-set• Appliesvariousdatatransformationsthatnormalizecolumnsandallowforuniformvaluerepresentation

• Theauthorsutilizedthescripttoformulaterulesandtransformationsforcleaningtools• Therearemanyhiddenfunctioncallsthatareimplicitlycalled

whichchangedata-values

Page 13: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Animal• Animaldata-setprovidedbyscientistsatUCBerkeleyabouttheeffectsoffirewoodcuttingonsmallterrestrialvertebrates• StructuralProperties

• Eachentrycontainsinformationaboutthetimeandlocationofcaptureofananimal,inadditiontoitsproperties:tagnumber,sex,weight,species,andagegroup

• Eachrecordwasmanuallyenteredintospreadsheetsfrombeinginitiallytranscribedonpaper(datafrom1993-2012years)

• Groundtruth:Thescientistsmanuallycleanedthedata-setandidentifiedseveralhundredsoferroneouscells• Errors:

• Shiftedfields• Wrongnumericvalues

Page 14: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Rayyan-Bib• Rayyan isasystembuiltatQCRItoassistscientistsintheproductionofsystematicreviews• literaturereviewswhichidentifyandsynthesizeallresearchevidencerelatedtoanuancedresearchquestion

• StructuralProperties:• Usersconsolidatesearchresultsintolonglistsofreferencestostudieswhichtheyfeelarerelevanttoansweringthequestion• Searchingmultipledatabasesusingmultiplequeries• Userscanmanuallymanipulatecitationssodataispronetoerror

• Entrieshavealotofattributes:article_title,journal_title,journal_abbreviation etc

• Groundtruth:Theauthorsmanuallycheckedasampleof1,000referencesfromRayyan’s database• Manymissingvaluesandinconsistenciesindata

• Journal_title andjornal_abbreviation areoftenswitched• Authornamesaresometimesfoundinjournal_title

Page 15: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

BlackOak• BlackOak Analyticsisacompanywhichprovidesentityresolutionsolutions• StructuralProperties:Providedanonymizedaddressdatasetandadirtyversionwhichtheyuseforevaluation• Groundtruthisgivenbecauseit’sasyntheticdata-set• Errorsarerandomlydistributed

• Errors:• Spellingofvalues• Formattingofvalues• Completeness• Fieldseparation

• Theauthorsuniquelyincludedthisdata-settoanalyzethedifferenceinerrordetectionperformancebetweenrealworldandsyntheticdatasets

Page 16: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

DataCleaningTools• Selecteddata-cleaningtoolswhichcoveredallfourerrortypes• Multipletoolssometimesfocusondifferentsubtypesofagivenerrortype• Iterativefine-tuningprocessforeachtool• Comparedetectederrorswithgroundtruthinordertoadjustthetoolconfigurationorrulesinordertoimproveperformance• Detectableerrorsarecountedtowardstherecallupperbound

Page 17: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Strategy1:OutlierDetection

• Detectdatavalueswhichdonotfollowthestatisticaldistributionoftheoveralldata• Tool1:Dboost

• Unique:Decomposesrun-ondatatypes(date)intotheirconstituentpieces(m,y,d)• Attributeswhicharewrappedinmorecomplexdatacanbeindividuallyanalyzedseparatelyforoutliers

• Histogramscreateadistributionofthedatawithoutanyapriori assumptionbycountingtheoccurrencesofuniquedata-values

• GaussianandGGMassumethateachvaluewasdrawnfromanormaldistributionwithgivenameanandstandarddeviationoramultivariateGaussiandistributionrespectively.

• OptimalParameterConfiguration:• Numberofbins&theirwidthsforhistograms• Mean&StandardDeviationforGaussianandGMM

Page 18: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Strategy2:Rule-basedErrorDetection

• Relyondata-qualityrulestodetecterrors:expressedusingintegrityconstraints• FunctionalDependencies• DenialConstraints• Violation:Collectionofcellsthatdonotconformtoagivenintegrityconstraint• Atleastonecellinvolvedintheviolationmustbechangedtoresolveaviolation

• Tool2:DC-Clean• Focusesondenialconstraints• TheauthorsdesignacollectionofDCstocapturethesemanticsofthedata• “iftherearetwocapturesofthesameanimalindicatedbythesametagnumber,thenthefirstcapturemustbemarkedasoriginal”

Page 19: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Strategy3:Pattern-basedDetection

• Tool3:OPENREFINE:Opensourcedatawranglingtool• Tool4:TRIFECTA:Communityversionofacommercialdatawranglingtool

• OPENRIFEANDTRIFECTAfocusonsyntacticpatterns:provideexplorationtechniquestodiscoverdata-inconsistencies

• Tool5:KATARA:Semanticpatterndiscoveryanddetectiontool• Focusesonsemanticpatternsmatchedagainstaknowledgebase

• ETL(Extract,Transform,Load)tools:pulldataoutofonedatabaseandplaceitinanother• Tool6:KNIME• Tool7:PENTAHO

Page 20: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Tool3:OPENREFINE

• OPENREFINEisanopensourcewranglingtoolthatcandigestdatainmultipleformats--facilitatesdata-exploration• FacetingOperation:Letsuserslookatdifferentkindsofaggregateddata—resemblesagroupingoperation• TheuserspecifiesonecolumnsforfacetingandOPENREFINEgeneratesawidgetthatshowsalldistinctvalues&theirnumberofoccurrences

• Filteringoperation:• TheusercanspecifyanexpressiononmultiplecolumnsandOPENREFINEgeneratesthewidgetbasedonvaluesoftheexpression

• TheusercanthenselectoneormorevaluesinthewidgetandOPENREFINEfiltersrowswhichdonotcontainselectedvalues

• Datacleaningusesaneditingoperation• Editsonecellatatime• Ifyoueditatextfacet,allcellsconsistentwiththatfacetwillupdate

Page 21: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Tool4:TRIFACTA

• TRIFACTAisthecommercialdescendantofDataWrangler:Predictsandappliesvarioussyntacticdata-transformationsfordatapreparationandcleaning.• Canapplybusiness&standardizationrulesthroughavailabletransformationscripts

• Appliesafrequencyanalysistoeachcolumntoidentifymostandleastfrequentvalues• Showsattributevaluesthatdeviatestronglyfromthevaluedistributioninthespecificattribute• Mapseachattributetoitsmostprominentdata-typeandidentifiesvaluesthatdonotmatch

Page 22: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Tool5:KATARA

• KATARAreliesonexternalknowledgebases,suchasYago,todetect&correcterrorswhichviolateasemanticpattern• Identifiesthetypeofacolumnandtherelationshipbetweentwocolumnsinthedata-setusingaknowledgebase• ThetypecolumnAinatablemightcorrespondtoCountryinknowledgebaseYago &therelationshipbetweencolumnsAandBmightcorrespondtothepredicateHasCapital inYqgo

• Basedonthediscoveredtypesandrelationship,Katara validatesvaluesusingtheknowledgebaseandhumanexperts• Exampe:Avalueof”California”incolumnAwillbemarkedasanerrorbecauseitisnotacountryinYago

Page 23: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Tool6:PENTAHO

• PENTAGOprovidesagraphicalinterfacewheredatawranglingcanbeimplementedasadirectedgraphofETL(Extract,Transform,load)operations• Anydata-manipulationorrulevalidationcanbeaddedasanodeintheETLpipelines• ExecutesmultipleETLworkflowstoclean/curatedataBUTrules/proceduresmustbespecifiedbyuser• Providesroutinesforstringtransformationandsinglecolumnconstraintvalidation

Page 24: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Tool7:KNIME

• Knime focusesonworkflowauthoringandencapsulatingdataprocessingtaskssuchascurationandmachinelearningbasedfunctionalityincompassablenodes• AlthoughKNIMEexecutesmultipleETLworkflows,similartoPENTAGO,toclean/curatedata,rules/proceduresmustbespecifiedbyuser• Usersmustknowexactlywhatkindsofrulesandpatternsneedtobeverified• UnlikeOPENREFINE&TRIFACTA,PENTAHOandKNIMEdonotprovidewaystoautomaticallydisplayoutliersanddetecttypemismatches

Page 25: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Strategy4:DuplicateDetection

• Iftworecordsrefertothesamereal-worldentity,buthavedifferingattributevalues,thereastrongchanceoneofthetwovaluesforeachrespectiverecordisanerror• Tool8:TAMR(commercialdescendantofDataTamersystem)• TAMRisatoolwithindustrialstrengthdataintegrationalgorithmsforrecordlinkageandschemamapping• Premisedonmachinelearningmodelsthatlearnduplicatefeatures

• Expertsourcing• SimilarityMetrics

Page 26: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

CombinationofMultipleTools

• Problem:Howdoesauserproperlycombinemultipleindependentdata-cleaningtools• Option1: Runalltoolsandapplyaunionormin-kstrategy• Option2:Haveusersmanuallycheckasampleofdetectederrors,whichcanbeusedtoguidetheprioritizationofdata-cleaningoperations

Page 27: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Option1:UnionAllandMin-k

• Unionall• Takestheunionoftheerrorsemittedbyalltools

• Min-k• Thoseerrorsdetectedbyatleastk-toolswhileexcludingthosedetectedbylessthank-tools• Noneedtokeepcleaningthedata-setwithnewtechniquesifthemaximumperformanceforerrordetectionhasbeenreached

Page 28: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

OrderingBasedonPrecision

• ProblemswithOption1(exhaustiveunion)• Expensivebecauseitrequiresmassiveamountsofhumanefforttovalidatelargenumberofcandidateerrors• BlackOak data-set:Auserwouldhavetoverifythe982,000cellsidentifiedaspossiblyerroneoustodiscover382,928actualerrors.• Resultsfromtoolswithpoorperformanceinerrordetectionforthisparticulardata-setshouldnotbeevaluted

• Alternative:Sampling-basedmethodtoselecttheorderinwhichdata-cleaningstrategieswillbeimplementedonthedata-set

Page 29: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

OrderingBasedonPrecision• CostModel:Althoughtheperformanceofatoolcanbemeasuredbyprecisionandrecallindetectingerrors,Precisionisabetterproxyforadata-cleansingtool’serrordetectionperformance• Recallcanonlybecomputedifalloftheerrorsinthedataareknown(fullgroundtruth)—thisisnearlyimpossiblewhenweexecuteerrordetectionstrategiesonnewdata-sets

• Precisioniseasytoestimate• AssumeC isthecostofhavingahumancheckadetectederrorandthatVisthevalueofidentifyingarealerror• Valuemustbehigherthancost!• P*V>(P+N)*C,wherePisthenumberofcorrectlydetectederrorsandNisthenumberoferroneouslydetectederrors(falsepositives)

• P/(P+N)>C/V• Setthreshold:σ =C/V

Page 30: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

OrderingBasedonPrecision• Anytoolwithaprecisionbelowσ shouldnotrunbecausethecostofcheckingisgreaterthanthevalueofaccuratelyidentifyingadata-error• Theratioisdomaindependent(unknowninmostcases);itisnaturaltohavelargeVvaluesforhighlyvaluabledata

• IfVisverylarge,alldata-cleansingtoolswillbeconsideredwiththecorrespondinglysmallthresholdvalue,whichboostsrecall

• IfthevalueofCishighanddominatestheratio,wesavecostonlyonthevalidationoftoolsthatareveryprecise—tradeoffwithrecall

• Theauthorsestimatetheprecisionoftheirdata-cleansingtoolonagivendata-setbycheckingarandomsampleofthedetectederrors.• Whynotrunallthetoolswithaprecisionhigherthanthresholdandevaluatetheunionofalltheirdetectederrorsets??• Toolsarenotindependentandsetsofdetectederrorsmayandoftendooverlap• Sometoolsmayhaveanextremelyhighprecision,butalloftheerrorstheydetectmaybecoveredwithtoolsthathaveevenhighprecisionvalues

Page 31: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

OrderingBasedonPrecision

• Maximumentropy-basedorderselection:FollowingtheMaximumEntropyprinciple,theauthorsdesignanalgorithmwhichassessestheestimateprecisionforagivendata-cleansingtool• Estimatesoverlapbetweentoolresults• Pickingthetoolwithhighestprecision(percentageofpositiveswhicharetrueovertotal)reducesentropythemostbecausehighentropyreferstouncertainty

Page 32: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

OrderingBasedonPrecision:Algorithm

1. Runalldata-cleaningtoolsontheentiredata-setandreturndetectederrors2. Estimateprecisionforeachtoolbyverifyingarandomsampleofitsdetected

errorswithahumanexpert3. Pickthetoolwiththehighestestimatedprecisionamongalldata-cleaningtools

notyetconsideredinordertomaximizeentropy&verifiesdetectederrorsonthecompletedata-setwhichhavenotbeenverifiedbefore

4. SinceerrorsvalidatedfromStep3mayhavebeendetectedbyothertools,weupdatetheprecisionoftheothertoolsandgotoS3topickthenexttoolifadata-cleaningstrategyexistswithanestimatedprevision> σ

Empirics:Regardlessofeachtool’sindividualperformance,theproposedorderreducescostofmanualverificationwithmarginalreductionofrecall.

Page 33: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

EvaluationMetrics

• D:dataset• G:purelycleaneddataset• E:diff(G,D)=E• T(D):thesetofcellsmarkedaserros bytoolT• Precision:• Recall:• AggregatedF:2(R*P)/(R+P)

Page 34: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

UsageofTools

• DBOOST:appliedthreealgorithms:Gaussian,histogram,GMM.WithparametersmakingFhighest.• DC-Clean: existing rules + manuallyconstructedFDrulesbasedonobviousn-to-1relationships• OpenRefine: facetmechanism+ formattingandsingle-columnrules• TRIFACTA: outlierdetectionandtype-verification+ formattingandsingle-columnrules• KATARA: manuallyconstructed& existing knowledgebase• PENTAHO & KNIME: modeleachtransformationandvalidationroutineasaworkflownodeintheETLprocess• TAMR: iterate training until the precision and recall become stable

Page 35: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories
Page 36: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Dataqualityrulesdefinedoneachdataset

Page 37: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

UserInvolvement

• Setrules• PerformdataexplorationusingOpenRefine andTRIFCTA• Validatetheresultoferrors• Gothroughtheremainingerrorsandtrytocategorizethem

Page 38: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

IndividualEffectiveness

Page 39: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

IndividualEffectiveness

• DBOOST:useless for Animal. Good for BlackOak• DC-Clean: good for Animal, Merck. Bad for MIT VPF• OpenRefine: bad for Animal, top 2 for others• TRIFACTA: bad for Animal recall, top 2 for others• KATARA: good for BlackOak, bad for MIT VPF• PENTAHO & KNIME: good on general• TAMR: found all duplicates for MIT VPF, and most of duplicates forBlackOak

Page 40: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories
Page 41: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories
Page 42: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Tool Combination Effectiveness

• Union All: High recall but low precision (lots of FP)

Page 43: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Min-K

• Require at least k algorithms agree on error• (K=1) == union all• As k increases, precision increases, recall decreases• Main problem: how to pick k

Page 44: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

OrderingbasedonBenefitandUserValidation• Randomly sample 5% of the detected errors for each tool andcompare them with ground truth for precision estimation.• Run tools in precision order (dynamically update the precisionestimation and drop tools that did not pass )• Baseline: simple union• Threshold: σ(0.1-0.5) (for precision)• As threshold increases, precision increases, FP decrease significantly,with TP decrease a little, causing recall decrease a little

Page 45: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Ordering Strategy results

Page 46: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Recall Upper-bound

• extra rules found by manually going through remaining errors

Page 47: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Domain Specific Tools

• For MIT VPF and BlackOak: ADDRESSCLEANER• Apply on a 1000 sample• Found 2 & 13 new errors. Recall: 0.93-0.95; 0.999-0.999

Page 48: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Enrichment

• Manually add more attributes to the original dataset (only those thatdid not introduce additional duplicate rows)• DC-Clean & TAMR

Page 49: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories

Conclusion

• Thereisnosingledominanttoolforthevariousdatasetsanddiversifiedtypesoferrors.Singletoolsachievedonaverage47%precisionand36%recall, showingthatacombinationoftoolsisneededtocoveralltheerrors.• Pickingtherightorderinapplyingthetoolscanimprovetheprecisionandhelpreducethecostofvalidationbyhumans.• Domainspecifictoolscanachievehighprecisionandrecallcomparedtogeneral-purposetools,achievingonaverage71%precisionand64%recall,butarelimitedtocertaindomains• Rule-basedsystemsandduplicatedetectionbenefitedfromdataenrichment.Inourexperiments,weachievedanimprovementofupto10%moreprecisionand7%morerecall