Efficient algorithms for mining outliers from large data sets

Efficient Algorithms for Mining Outliers from Large Data Sets

Sridhar Ramaswamy�

Epiphany Inc.Palo Alto, CA 94403

[email protected]

Rajeev Rastogi

Bell LaboratoriesMurray Hill, NJ [email protected]

Kyuseok Shim�

KAIST�and AITrc

�Taejon, KOREA

[email protected]

AbstractIn this paper, we propose a novel formulation for distance-basedoutliers that is based on the distance of a point from its�� nearestneighbor. We rank each point on the basis of its distance to its� �� nearest neighbor and declare the top� points in this rankingto be outliers. In addition to developing relatively straightforwardsolutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficientpartition-based algorithm for mining outliers. This algorithm firstpartitions the input data set into disjoint subsets, and then prunesentire partitions as soon as it is determined that they cannot containoutliers. This results in substantial savings in computation. Wepresent the results of an extensive experimental study on real-lifeand synthetic data sets. The results from a real-life NBA databasehighlight and reveal several expected and unexpected aspects ofthe database. The results from a study on synthetic data setsdemonstrate that the partition-based algorithm scales well withrespect to both data set size and data set dimensionality.

1 IntroductionKnowledge discovery in databases, commonly referred toas data mining, is generating enormous interest in both theresearch and software arenas. However, much of this recentwork has focused on finding “large patterns.” By the phrase“large patterns”, we mean characteristics of the input datathat are exhibited by a (typically user-defined) significantportion of the data. Examples of these large patternsinclude association rules[AMS 95], classification[RS98]and clustering[ZRL96, NH94, EKX95, GRS98].

In this paper, we focus on the converse problem of finding“small patterns” oroutliers. An outlier in a set of datais an observation or a point that is considerably dissimilaror inconsistent with the remainder of the data. From the

The work was done while the author was with Bell Laboratories.�Korea Advanced Institute of Science and Technology�Advanced Information Technology Research Center at KAIST

above description of outliers, it may seem that outliers area nuisance—impeding the inference process—and must bequickly identified and eliminated so that they do not interferewith the data analysis. However, this viewpoint is often toonarrow since outliers contain useful information. Mining foroutliers has a number of useful applications in telecom andcredit card fraud, loan approval, pharmaceutical research,weather prediction, financial applications, marketing andcustomer segmentation.

For instance, consider the problem of detecting credit cardfraud. A major problem that credit card companies face isthe illegal use of lost or stolen credit cards. Detecting andpreventing such use is critical since credit card companiesassume liability for unauthorized expenses on lost or stolencards. Since the usage pattern for a stolen card is unlikelyto be similar to its usage prior to being stolen, the newusage points are probably outliers (in an intuitive sense) withrespect to the old usage pattern. Detecting these outliers isclearly an important task.

The problem of detecting outliers has been extensivelystudied in the statistics community (see [BL94] for a goodsurvey of statistical techniques). Typically, the user hasto model the data points using a statistical distribution,and points are determined to be outliers depending onhow they appear in relation to the postulated model. Themain problem with these approaches is that in a numberof situations, the user might simply not have enoughknowledge about the underlying data distribution. In orderto overcome this problem, Knorr and Ng [KN98] proposethe following distance-based definition for outliers that isboth simple and intuitive:A point in a data set is an outlierwith respect to parameters � and � if no more than � pointsin the data set are at a distance of � or less from �� . Thedistance function can be any metric distance function� .

The main benefit of the approach in [KN98] is that it doesnot require any apriori knowledge of data distributions thatthe statistical methods do. Additionally, the definition ofoutliers considered is general enough to model statistical�

The precise definition used in [KN98] is slightly different from, butequivalent to, this definition.�

The algorithms proposed assume that the distance between two pointsis the euclidean distance between the points.

1

Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. MOD 2000, Dallas, TX USA © ACM 2000 1-58113-218-2/00/05 . . .$5.00

427

outlier testsfor normal,poissonandotherdistributions.Theauthors� go on to proposea numberof efficient algorithmsfor findingdistance-basedoutliers.Onealgorithmis ablocknested-loopalgorithmthathasrunningtimequadraticin theinputsize.Anotheralgorithmis basedondividing thespaceinto a uniform grid of cells and then using thesecells tocomputeoutliers. This algorithmis linear in thesizeof thedatabasebut exponentialin thenumberof dimensions.(Thealgorithmsarediscussedin detailin Section2.)

Thedefinitionof outliersfrom [KN98] hastheadvantagesof beingbothintuitive andsimple,aswell asbeingcompu-tationally feasiblefor largesetsof datapoints. However, italsohascertainshortcomings:

1. It requirestheusertospecifyadistance� whichcouldbedifficult to determine(theauthorssuggesttrial anderrorwhichcouldrequireseveraliterations).

2. It does not provide a ranking for the outliers—forinstanceapointwith veryfew neighboringpointswithina distance� canbe regardedin somesenseasbeingastronger outlier thanapointwith moreneighborswithindistance� .

3. Thecell-basedalgorithmwhosecomplexity is linear inthesizeof thedatabasedoesnotscalefor highernumberof dimensions(e.g.,5) sincethenumberof cellsneededgrowsexponentiallywith dimension.

In thispaper, we focuson presentinga new definitionforoutliersanddevelopingalgorithmsfor mining outliersthataddressthe above-mentioneddrawbacksof the approachfrom [KN98]. Specifically, our definition of an outlierdoesnot require usersto specify the distanceparameter� . Instead,it is basedon the distanceof the �� nearestneighborof a point. For a � andpoint , let �� ! denotethe distanceof the �� nearestneighborof . Intuitively,�"��# $ is a measureof how muchof an outlier point is.For example,pointswith largervaluesfor �"��# $ havemoresparseneighborhoodsandarethustypicallystrongeroutliersthanpointsbelongingto denseclusterswhich will tendtohave lower valuesof � � �� $ . Since,in general,the userisinterestedin thetop % outliers,wedefineoutliersasfollows:Given a � and % , a point is an outlier if no more than %'&)(other points in the data set have a higher value for �� than . In otherwords,the top % pointswith the maximum �"�valuesareconsideredoutliers. We refer to theseoutliersasthe �"�* (pronounced“dee-kay-en”)outliersof a dataset.

Theabovedefinitionhasintuitiveappealsincein essence,it rankseachpointbasedon its distancefrom its � �� nearestneighbor. With our new definition, the user is no longerrequiredto specifythedistance� to definetheneighborhoodof a point. Instead,he/shehas to specify the numberofoutliers % that he/sheis in interestedin—our definitionbasicallyusesthe distanceof the �� neighborof the %��outlierto definetheneighborhooddistance� . Usually, % canbeexpectedto beverysmallandis relatively independentof

theunderlyingdataset,thusmakingit easierfor theusertospecifycomparedto � .

Thecontributionsof thispaperareasfollows:+ Weproposeanoveldefinitionfor distance-basedoutliersthathasgreatintuitiveappeal.Thisdefinitionis basedonthedistanceof a point from its � �� nearestneighbor.+ Themaincontribution of this paperis a partition-basedoutlier detectionalgorithmthatfirst partitionsthe inputpointsusinga clusteringalgorithm,andcomputeslowerand upperboundson �"� for points in eachpartition.It then usesthis information to identify the partitionsthat cannot possibly contain the top % outliers andprunes them. Outliers are then computedfrom theremainingpoints (belongingto unprunedpartitions)ina final phase.Since % is typically small, our algorithmprunesasignificantnumberof points,andthusresultsinsubstantialsavingsin theamountof computation.+ We presentthe resultsof a detailedexperimentalstudyof thesealgorithmson real-life andsyntheticdatasets.The results from a real-life NBA databasehighlightand reveal several expectedandunexpectedaspectsofthe database. The results from a study on syntheticdatasetsdemonstratethat thepartition-basedalgorithmscaleswell with respectto bothdatasetsizeanddatasetdimensionality. It alsoperformsmorethanan orderofmagnitudebetterthanthe nested-loopand index-basedalgorithms.

The rest of this paper is organizedas follows. Sec-tion 2 discussesrelatedresearchin theareaof finding out-liers. Section3 presentsthe problem definition and thenotation that is usedin the rest of the paper. Section4presentsthenestedloopandindex-basedalgorithmsfor out-lier detection.Section5 discussesour partition-basedalgo-rithm for outlier detection. Section6 containsthe resultsfrom our experimentalanalysisof the algorithms. We an-alyzedthe performanceof the algorithmson real-life andsyntheticdatabases.Section7 concludesthe paper. Thework reportedin this paperhasbeendonein the contextof the Serendipdata mining project at Bell Laboratories(www.bell-labs.com/projects/serendip).

2 RelatedWorkClusteringalgorithmslike CLARANS [NH94], DBSCAN[EKX95], BIRCH [ZRL96] and CURE [GRS98] consideroutliers,but only to the point of ensuringthat they do notinterferewith theclusteringprocess.Further, thedefinitionof outliersusedis in a sensesubjective and relatedto theclustersthat are detectedby thesealgorithms. This is incontrastto ourdefinitionof distance-basedoutlierswhich ismoreobjectiveandindependentof how clustersin theinputdatasetareidentified. In [AAR96], theauthorsaddresstheproblemof detectingdeviations – after seeinga seriesof

2

428

Symbol, Description-Numberof neighborsof apoint thatweareinterestedin.0/Distanceof point 1 to its

- �� nearestneighbor2 Totalnumberof outliersweareinterestedin3Totalnumberof input points4Dimensionalityof theinput5Amountof memoryavailable687�9;:Distancebetweenapairof points

MINDIST Minimum distancebetweenapoint/MBRandMBRMAXDIST Maximumdistancebetweenapoint/MBRandMBR

Table1: NotationUsedin thePaper

similar data,an elementdisturbingthe seriesis consideredan exception. Table analysismethodsfrom the statisticsliteratureareemployed in [SAM98] to attackthe problemof findingexceptionsin OLAP datacubes.A detailedvalueof thedatacubeis calledanexceptionif it is foundto differsignificantly from the anticipatedvaluecalculatedusing amodelthat takesinto accountall aggregates(group-bys)inwhich thevalueparticipates.

As mentionedin theintroduction,theconceptof distance-basedoutliers was developedand studiedby Knorr andNg in [KN98]. In this paper, for a � and � , the authorsdefinea point to beanoutlier if at most � pointsarewithindistance� of the point. They presenttwo algorithmsforcomputingoutliers. Oneis a simplenested-loopalgorithmwith worst-casecomplexity <=�?>A@B�C where > is thenumberof dimensionsand @ is thenumberof pointsin thedataset.In order to overcomethe quadratic time complexity ofthenested-loopalgorithm,theauthorsproposea cell-basedapproachfor computingoutliersin which the > dimensionalspaceis partitionedinto cellswith sidesof length D�FE G . The

time complexity of this cell-basedalgorithmis <=�?H8GJIK@L whereH is a numberthatis inverselyproportionalto � . Thiscomplexity is linear is @ but exponentialin the numberofdimensions.As aresult,dueto theexponentialgrowth in thenumberof cells asthe numberof dimensionsis increased,the nestedloop outperformsthe cell-basedalgorithm fordimensionsM andhigher.

While existingwork onoutliersfocusesonly on theiden-tification aspect,the work in [KN99] alsoattemptsto pro-vide intensional knowledge, which is basicallyan explana-tion of why anidentifiedoutlier is exceptional.Recently, in[BKNS00], thenotionof local outliersis introduced,whichlike �"�* outliers,dependontheirlocalneighborhoods.How-ever, unlike �"�* outliers,local outliersaredefinedwith re-spectto thedensitiesof theneighborhoods.

3 ProblemDefinition and NotationIn this section,we first presenta precisestatementof theproblemof mining outliers from point datasets. We thenpresentsomedefinitions that are used in describingouralgorithms. Table 1 describesthe notationthat we useintheremainderof thepaper.

3.1 ProblemStatementRecallfrom theintroductionthatweuse�"��# $ to denotethedistanceof point from its �� nearestneighbor. We rankpointson the basisof their � � �� $ distance,leadingto thefollowing definitionfor ��* outliers:

Definition 3.1: Given an input data set with @ points,parameters% and � , a point is a ��* outlier if therearenomorethan %N&B( otherpoints PO suchthat � � �� $OQ SRT� � �# $ . U

In other words, if we rank points accordingto their�"�� ! distance, the top % points in this ranking areconsideredto beoutliers.We canuseany of the VXW metricslike the V � (“manhattan”)or V � (“euclidean”)metricsformeasuringthedistancebetweenapairof points.Alternately,for certain application domains (e.g., text documents),nonmetricdistancefunctionscanalsobe used,makingourdefinitionof outliersverygeneral.

With theabovedefinitionfor outliers,it is possibleto rankoutliersbasedontheir �"��# $ distances—outlierswith larger�"�� ! distanceshavefewerpointscloseto themandarethusintuitively strongeroutliers.Finally, wenotethatfor agiven� and � , if thedistance-baseddefinitionfrom [KN98] resultsin % O outliers,theneachof themis a �"�*ZY outlier accordingto ourdefinition.

3.2 DistancesbetweenPoints and MBRsOne of the key technical tools we use in this paper isthe approximationof a set of points using their minimumboundingrectangle(MBR). Then,by computinglower andupperboundson �"�[�� $ for points in eachMBR, we areableto identify andpruneentireMBRs thatcannotpossiblycontain �"�* outliers.Thecomputationof boundsfor MBRsrequiresus to definethe minimum and maximum distancebetweentwo MBRs. Outlier detectionis also aided bythe computationof the minimum and maximum possibledistancebetweena point and an MBR, which we definebelow.

In this paper, we use the squareof the euclideandis-tance(insteadof theeuclideandistanceitself)asthedistancemetric sinceit involvesfewer andlessexpensive computa-tions.Wedenotethedistancebetweentwo points and \ by�^]`_ba8�� dce\Z . Let usdenoteapoint in > -dimensionalspacebyf � c? � cbghghgbc� GFi anda > -dimensionalrectanglej by thetwoendpointsof its major diagonal: kml f k � c;k � cbghgbg8c;k G i andk O l f k O� cek O� chghgbghcek OG i suchthat kbnpoKk On for (qoK]ros% . Let usdenotetheminimumdistancebetweenpoint andrectanglej by MINDIST( tcuj ). Every point in j is at a distanceofat leastMINDIST( dcej ) from . Thefollowing definitionofMINDIST is from [RKV95]:

Definition 3.2: MINDIST( tcuj ) = v Gnxw ��y �n , wherezNote that more than 2 points may satisfy our definition of

. /{outliers—inthiscase,any 2 of themsatisfyingourdefinitionareconsidered. /{ outliers.

3

429

y n l |} ~ k n &� n if nX� k n $n�&�k On if k On � Pn�otherwise

We denotethe maximumdistancebetweenpoint andrectanglej by MAXDIST( dcej ). That is, no point in j isat a distancethat exceedsMAXDIST(p, R) from point .MAXDIST( dcej ) is calculatedasfollows:

Definition 3.3: MAXDIST( tcej ) = v GnQw ��y �n , where

y n l�� k On &� n if nX��`� � Y�� n &�k n otherwise

We next define the minimum and maximum distancebetweentwo MBRs. Let j and � be two MBRs definedby the endpointsof their major diagonal ( kAc;kAO and _�cu_COrespectively) asbefore. We denotethe minimum distancebetweenj and � by MINDIST( j�cu� ). Every point in jis at a distanceof at leastMINDIST( j�cF� ) from any pointin � (and vice-versa). Similarly, the maximum distancebetweenj and � , denotedby MAXDIST( j�cu� ) is defined.The distancescan be calculatedusing the following twoformulae:

Definition 3.4: MINDIST( j�cu� ) = v Gnxw �!y �n , where

y ndl |} ~ kbn�&�_COn if _COn � kbn_bn�&�k On if k On � _Cn�otherwise

Definition 3.5: MAXDIST( j�cF� ) = v GnQw �!y �n , where y n l��A�!�� _ On &�kbn � c � k On &�_Cn �� .4 Nested-Loopand Index-Based

AlgorithmsIn this section,we describetwo relatively straightforwardsolutionsto theproblemof computing��* outliers.

Block Nested-LoopJoin: The nested-loopalgorithmforcomputingoutliers simply computes,for eachinput point , �"�� ! , the distanceof its �� nearestneighbor. It thenselectsthe top % pointswith the maximum �� values. Inorder to compute �"� for points, the algorithm scansthedatabasefor eachpoint . For a point , a list of the �nearestpointsfor is maintained,andfor eachpoint \ fromthe databasewhich is considered,a checkis madeto seeif �^]`_ba8�� dce\Z is smallerthanthe distanceof the �� nearestneighborfoundso far. If thechecksucceeds,\ is includedin thelist of the � nearestneighborsfor (if thelist containsmorethan � neighbors,thenthe point that is furthestawayfrom is deletedfrom the list). Thenested-loopalgorithmcanbe madeI/O efficient by computing�"� for a block ofpointstogether.

Index-Based Join: Even with the I/O optimization,the nested-loopapproachstill requires <=�?@B�� distancecomputations.This is expensivecomputationally, especiallyif the dimensionalityof points is high. The numberofdistancecomputationscanbesubstantiallyreducedby usinga spatialindex likean jN� -tree[BKSS90].

If we have all the points storedin a spatial index likethe jN� -tree,thefollowing pruningoptimization,which waspointed out in [RKV95], can be applied to reduce thenumberof distancecomputations:Supposethat we havecomputed�"��# $ for by looking at a subsetof the inputpoints.Thevaluethatwehave is clearlyanupperboundfortheactual � � �# $ for . If theminimumdistancebetween andthe MBR of a nodein the R� -tree exceedsthe �� ! valuethatwe have currently, none of thepointsin thesub-tree rooted under the node will be amongthe � nearestneighborsof . This optimizationletsus pruneentiresub-treescontainingpointsirrelevant to the � -nearestneighborsearchfor .�

In addition, sincewe are interestedin computingonlythe top % outliers, we can apply the following pruningoptimization for discontinuingthe computationof �� ! for a point . Assumethat during eachstepof the index-basedalgorithm,we storethe top % outlierscomputed.Let� *^� n * be the minimum �"� amongthesetop outliers. Ifduring the computationof �"��# $ for a point , we findthat the valuefor �"��# $ computedso far hasfallen below� *^� n * , weareguaranteedthatpoint cannotbeanoutlier.Therefore, it can be safely discarded. This is because�"�� ! monotonicallydecreases asweexaminemorepoints.Therefore, is guaranteedto notbeoneof thetop % outliers.Notethatthisoptimizationcanalsobeappliedto thenested-loopalgorithm.

ProcedurecomputeOutliersIndex for computing�"�* out-liers is shown in Figure1. It usesProceduregetKthNeigh-borDist in Figure2 asa subroutine.In computeOutliersIn-dex, pointsarefirst insertedinto anR� -tree index (any otherspatialindex structurecanbeusedinsteadof theR� -tree)insteps1 and2. TheR� -tree is usedto computethe �� near-estneighborfor eachpoint. In addition,theprocedurekeepstrackof the % pointswith themaximumvaluefor �"� atanypointduringits executionin aheapoutHeap.Thepointsarestoredin the heapin increasingorderof �� , suchthat thepoint with thesmallestvaluefor �� is at the top. This �"�valueis alsostoredin thevariableminDkDistandpassedtothegetKthNeighborDistroutine.Initially, outHeapis emptyandminDkDist is

�.

The for loop spanningsteps5-13 calls getKthNeighbor-Dist for eachpoint in theinput, insertingthepoint into out-Heapif thepoint’s �"� valueis amongthetop % valuesseen�

Note that the work in [RKV95] usesa tighter bound called MIN-MAXDIST in orderto prunenodes.This is becausethey want to find themaximumpossibledistancefor the nearestneighborpoint of 1 , not the

-nearestneighborsaswe aredoing. Whenlooking for thenearestneighborof a point, we canhave a tighter boundfor the maximumdistanceto thisneighbor.

4

430

ProcedurecomputeOutliersIndex( � ,� )begin1. for eachpoint � in inputdatasetdo2. insertIntoIndex(Tree,� )3. outHeap:= �4. minDkDist := 05. for eachpoint � in inputdatasetdo �6. getKthNeighborDist(Tree.Root,� , � , minDkDist)7. if (� .DkDist � minDkDist) �8. outHeap.insert(� )9. if (outHeap.numPoints()�� ) outHeap.deleteTop()10. if (outHeap.numPoints() �� )11. minDkDist := outHeap.top().DkDist12. ¡13. ¡14. return outHeapend

Figure1: Index-BasedAlgorithm for ComputingOutliers

so far ( .DkDist storesthe �"� value for point ). If theheap’ssizeexceeds% , thepointwith thelowest �"� valueisremovedfrom theheapandminDkDistupdated.

ProceduregetKthNeighborDistcomputes�"�� ! for point by examiningnodesin the R� -tree. It doesthis usingalinked list nodeList. Initially, nodeListcontainstheroot ofthe R� -tree. Elementsin nodeListaresorted,in ascendingorder of their MINDIST from . ¢ During eachiterationof the while loop spanninglines 4–23,the first nodefromnodeListis examined.

If the node is a leaf node, points in the leaf node areprocessed. In order to aid this processing,the � nearestneighborsof among the points examined so far arestoredin theheapnearHeap.nearHeapstorespointsin thedecreasingorderof their distancefrom . .Dkdist stores�"� for from thepointsexamined.(It is £ until � pointsare examined.) If at any time, a point \ is found whosedistanceto is lessthan .Dkdist, \ is insertedintonearHeap(steps8–9). If nearHeapcontainsmore than � points,the point at the top of nearHeapdiscarded,and .Dkdistupdated(steps10–12).If at any time,thevaluefor .Dkdistfalls below minDkDist (recall that .Dkdist monotonicallydecreasesas we examine more points), point cannotbe an outlier. Therefore,proceduregetKthNeighborDistimmediatelyterminatesfurther computationof �"� for andreturns(step13). Thisway, getKthNeighborDistavoidsunnecessarycomputationfor a point the moment it isdeterminedthatit is notanoutlier candidate.

On the other hand, if the nodeat the headof nodeListis an interior node,the nodeis expandedby appendingitschildrento nodeList. ThennodeListis sortedaccordingtoMINDIST (steps17–18). In the final steps20–22,nodeswhose minimum distancefrom exceed .DkDist, arepruned. Pointscontainedin thesenodesobviously cannotqualify to be amongst ’s � nearestneighborsand can be¤

Distancesfor nodesareactuallycomputedusingtheirMBRs.

ProceduregetKthNeighborDist(Root,� , � , minDkDist)begin1. nodeList:= � Root ¡2. � .Dkdist:= ¥3. nearHeap:= �4. while nodeListis notemptydo �5. deletethefirst element,Node,from nodeList6. if (Nodeis a leaf) �7. for eachpoint ¦ in Nodedo8. if ( §A¨ª©u«u¬��$®¦C¯�°�� .DkDist) �9. nearHeap.insert(¦ )10. if (nearHeap.numPoints()�)� ) nearHeap.deleteTop()11. if (nearHeap.numPoints() L� )12. � .DkDist := §A¨ª©u« (� , nearHeap.top())13. if (� .Dkdist ± minDkDist) return14. ¡15. ¡16. else �17. appendNode’s childrento nodeList18. sortnodeListby MINDIST19. ¡20. for eachNodein nodeListdo21. if (� .DkDist ± MINDIST( � ,Node))22. deleteNodefrom nodeList23. ¡end

Figure2: Computationof Distancefor �� NearestNeighbor

safelyignored.

5 Partition-Based AlgorithmThefundamentalshortcomingwith thealgorithmspresentedin the previous section is that they are computationallyexpensive. This is becausefor eachpoint in thedatabasewe initiate the computationof �� $ , its distancefrom its�� nearestneighbor. Sincewe areonly interestedin thetop % outliers, and typically % is very small, the distancecomputationsfor mostof the remainingpointsareof littleuseandcanbealtogetheravoided.

The partition-basedalgorithm proposedin this sectionprunesout points whosedistancesfrom their �� nearestneighborsare so small that they cannotpossiblymake itto the top % outliers. Furthermore,by partitioning thedataset,it is ableto make this determinationfor a point withoutactuallycomputingtheprecisevalueof � � �� ! . Ourexperimentalresultsin Section6 indicatethat this pruningstrategy canresultin substantialperformancespeedupsdueto savingsin bothcomputationandI/O.

5.1 OverviewThe key idea underlying the partition-basedalgorithm isto first partition the dataspace,and then prunepartitionsas soonas it can be determinedthat they cannotcontainoutliers.Since% will typically beverysmall,thisadditionalpreprocessingstepperformedat thegranularityof partitionsratherthanpointseliminatesa significantnumberof points

5

431

as outlier candidates.Consequently, �� nearestneighborcomputations² needto beperformedfor veryfew points,thusspeedingup thecomputationof outliers.Furthermore,sincethenumberof partitionsin thepreprocessingstepis usuallymuch smallercomparedto the numberof points, and thepreprocessingis performedat the granularityof partitionsratherthanpoints,theoverheadof preprocessingis low.

We briefly describethestepsperformedby thepartition-basedalgorithmbelow, anddeferthepresentationof detailsto subsequentsections.

1. Generatepartitions: In thefirst step,we usea cluster-ing algorithmto clusterthedataandtreateachclusterasa separatepartition.

2. Compute boundson �"� for points in eachpartition:For each partition ³ , we compute lower and upperbounds(storedin ³ .lower and ³ .upper, respectively)on �"� for pointsin thepartition. Thus,for every point �´)³ , � � �# $ SµJ³ .lowerand � � �# $ SoJ³ .upper.

3. Identify candidate partitions containing outliers: Inthis step,we identify the candidate partitions, that is,the partitions containingpoints which are candidatesfor outliers. Supposewe could computeminDkDist,the lower bound on �� for the % outliers. Then, if³ .upperfor a partition ³ is lessthanminDkDist, noneof the points in ³ can possibly be outliers. Thus,only partitions ³ for which ³ .upper µ minDkDist arecandidatepartitions.

minDkDist canbe computedfrom ³ .lower for thepar-titions asfollows. Considerthepartitionsin decreasingorderof ³ .lower. Let ³ � chgbghghcu³t¶ be thepartitionswiththemaximumvaluesfor ³ .lowersuchthatthenumberofpointsin thepartitionsis at least% . Then,a lowerboundon � � for anoutlier is �=·Q¸�� ³ n g lower ¹[(ºoT]poT» � .

4. Compute outliers fr om points in candidate parti-tions: In the final step,the outliersarecomputedfromamongthe points in the candidatepartitions. For eachcandidatepartition ³ , let ³ .neighborsdenotetheneigh-boring partitionsof ³ , whichareall thepartitionswithindistance³ .upperfrom ³ . Pointsbelongingto neighbor-ing partitionsof ³ aretheonly pointsthatneedto beex-aminedwhencomputing�"� for eachpoint in ³ . Sincethenumberof pointsin thecandidatepartitionsandtheirneighboringpartitionscouldbecomequitelarge,wepro-cessthe points in the candidatepartitions in batches,eachbatchinvolving asubsetof thecandidatepartitions.

5.2 GeneratingPartitionsPartitioningthedataspaceinto cellsandthentreatingeachcell as a partition is impractical for higher dimensionalspaces.This approachwasfoundto beineffective for morethan4 dimensionsin [KN98] dueto theexponentialgrowthin thenumberof cellsasthenumberof dimensionsincrease.

For effective pruning,we would like to partitionthedatasuchthat pointswhich areclosetogetherareassignedto asinglepartition.Thus,employing a clusteringalgorithmforpartitioningthe datapointsis a goodchoice. A numberofclusteringalgorithmshave beenproposedin the literature,most of which have at least quadratic time complexity[JD88]. Since @ could be quite large, we are moreinterestedin clusteringalgorithmsthatcanhandlelargedatasets. Among algorithmswith lower complexities is thepre-clusteringphaseof BIRCH [ZRL96], a state-of-the-artclusteringalgorithm that can handlelarge datasets. Thepre-clusteringphasehas time complexity that is linear inthe input size andperformsa singlescanof the database.It storesa compactsummarizationfor eachclusterin a CF-treewhich is a balancedtreestructuresimilar to an j -tree[Sam89]. For eachsuccessive point, it traversesthe CF-tree to find the closestcluster, and if the point is within athresholddistance¼ of thecluster, it is absorbedinto it; else,it startsanew cluster. In casethesizeof theCF-treeexceedsthe main memorysize ½ , the threshold¼ is increasedandclustersin theCF-treethatarewithin (thenew increased)¼distanceof eachotheraremerged.

Themain memorysize ½ andthe pointsin the datasetare given as inputs to BIRCH’s pre-clusteringalgorithm.BIRCH generatesa set of clusterswith generallyuniformsizesandthat fit in ½ . We treateachclusterasa separatepartition – the points in the partition aresimply the pointsthat were assignedto its clusterduring the pre-clusteringphase. Thus, by controlling the memorysize ½ input toBIRCH, we cancontrol thenumberof partitionsgenerated.We representeachpartitionby theMBR for its points.NotethattheMBRs for partitionsmayoverlap.

We must emphasizethat we useclusteringheresimplyasa heuristicfor efficiently generatingdesirablepartitions,andnot for computingoutliers.Most clusteringalgorithms,includingBIRCH,performoutlierdetection;howeverunlikeour notion of outliers, their definition of outliers is notmathematicallyprecise and is more a consequenceofoperationalconsiderationsthatneedto beaddressedduringtheclusteringprocess.

5.3 Computing Boundsfor PartitionsFor the purposeof identifying the candidatepartitions,weneed to first computethe bounds ³ .lower and ³ .upper,which have the following property: for all points �´¾³ ,³ .lower o��"�� ! Xo�³ .upper. Thebounds³ .lower/³ .upperfor apartition ³ canbedeterminedby findingthe » partitionsclosestto ³ with respectto MINDIST/MAXDIST suchthatthe numberof points in ³ � chghgbghcu³t¶ is at least � . Sincethepartitionsfit in mainmemory, a mainmemoryindex canbeusedto find the » partitionsclosestto ³ (for eachpartition,its MBR is storedin theindex).

ProcedurecomputeLowerUpper for computing ³ .lowerand ³ .upperfor partition ³ is shown in Figure3. Amongitsinput parametersaretheroot of theindex containingall the

6

432

Procedure computeLowerUpper(Root,¿ ,-, minDkDist)

begin1. nodeListÀ := Á Root Â2. ¿ .lower := ¿ .upper:= Ã3. lowerHeap:= upperHeap:= Ä4. while nodeListis not emptydo Á5. deletethefirst element,Node,from nodeList6. if (Nodeis a leaf) Á7. for eachpartition Å in NodeÁ8. if (MINDIST( ¿ , Å ) Æ=¿ .lower) Á9. lowerHeap.insert(Å )10. while lowerHeap.numPoints()Ç11. lowerHeap.top().numPoints()È - do12. lowerHeap.deleteTop()13. if (lowerHeap.numPoints()È - )14. ¿ .lower := MINDIST( ¿ , lowerHeap.top())15. Â16. if (MAXDIST( ¿ , Å ) Æ=¿ .upper)Á17. upperHeap.insert(Å )18. while upperHeap.numPoints()Ç19. upperHeap.top().numPoints()È - do20. upperHeap.deleteTop()21. if (upperHeap.numPoints()È - )22. ¿ .upper:= MAXDIST( ¿ , upperHeap.top())23. if ( ¿ .upper É minDkDist) return24. Â25. Â26. Â27. else Á28. appendNode’s childrento nodeList29. sortnodeListby MINDIST30. Â31. for eachNodein nodeListdo32. if ( ¿ .upper É MAXDIST( ¿ ,Node)and33. ¿ .lower É MINDIST( ¿ ,Node))34. deleteNodefrom nodeList35. Âend

Figure 3: Computationof Lower and Upper BoundsforPartitions

partitionsandminDkDist,whichis a lowerboundon �"� foranoutlier. Theprocedureis invokedby theprocedurewhichcomputesthecandidatepartitions,computeCandidateParti-tions, shown in Figure4 that we will describein the nextsubsection. ProcedurecomputeCandidatePartitions keepstrackof minDkDistandpassesthis to computeLowerUpperso thatcomputationof the boundsfor a partition ³ canbeoptimized. The idea is that if ³ .upperfor partition ³ be-comeslessthanminDkDist, thenit cannotcontainoutliers.Computationof boundsfor it canceaseimmediately.

computeLowerUpper is similar to proceduregetKth-NeighborDistdescribedin the previous section(seeFig-ure 2). It stores partitions in two heaps, lowerHeapand upperHeap,in the decreasingorderof MINDIST andMAXDIST from ³ , respectively – thus,partitionswith thelargestvaluesof MINDIST andMAXDIST appearat thetopof theheaps.

5.4 Computing CandidatePartitionsThis is the crucial stepin our partition-basedalgorithminwhich we identify the candidatepartitionsthat canpoten-tially containoutliers,and prunethe remainingpartitions.

Theideais to usetheboundscomputedin theprevioussec-tion to first estimateminDkDist,which is a lower boundon�"� for an outlier. Thena partition ³ is a candidateonlyif ³ .upper µ minDkDist. Thelower boundminDkDist canbe computedusingthe ³ .lower valuesfor thepartitionsasfollows. Let ³ � ghgbg8ce³ ¶ bethepartitionswith themaximumvaluesfor ³ .lower andcontainingat least % points. ThenminDkDist l �=·Q¸�� ³ n g lower ¹�(�oÊ]JoË» � is a lower boundon �� for anoutlier.

The procedurefor computing the candidatepartitionsfrom among the set of partitions PSet is illustrated inFigure 4. The partitions are stored in a main memoryindex andcomputeLowerUpperis invoked to computethelower and upper bounds for each partition. However,insteadof computingminDkDistaftertheboundsfor all thepartitionshavebeencomputed,computeCandidatePartitionsstores,in the heappartHeap,the partitionswith the largest³ .lower values and containing at least % points amongthem. The partitions are stored in increasingorder of³ .lowerin partHeapandminDkDistis thusequalto ³ .lowerfor the partition ³ at the top of partHeap. The benefitof maintaining minDkDist is that it can be passedasa parameterto computeLowerUpper(in Step 6) and thecomputationof boundsfor a partition ³ canbehaltedearlyif ³ .upperfor it falls below minDkDist. If, for a partition³ , ³ .lower is greaterthanthecurrentvalueof minDkDist,thenit is insertedinto partHeapandthevalueof minDkDistis appropriatelyadjusted(steps8–13).

Procedure computeCandidatePartitions(PSet,-, 2 )

begin1. for eachpartition ¿ in PSetdo2. insertIntoIndex(Tree, ¿ )3. partHeap:= Ä4. minDkDist := Ì5. for eachpartition ¿ in PSetdo Á6. computeLowerUpper(Tree.Root,¿ ,

-, minDkDist)

7. if ( ¿ .lower Í minDkDist) Á8. partHeap.insert(¿ )9. while partHeap.numPoints()Ç10. partHeap.top().numPoints()È 2 do11. partHeap.deleteTop()12. if (partHeap.numPoints()È 2 )13. minDkDist := partHeap.top().lower14. Â15. Â16. candSet:= Ä17. for eachpartition ¿ in PSetdo18. if ( ¿ .upper È minDkDist) Á19. candSet:= candSetÎºÁu¿pÂ20. ¿ .neighbors:=21. ÁFÅ : Å�Ï PSetandMINDIST( ¿ , Å ) É=¿ .upperÂ22. Â23. return candSetend

Figure4: Computationof CandidatePartitions

In the for loop over steps17–22, the set of candidatepartitions candSetis computed,and for each candidatepartition ³ , partitionsÐ thatcanpotentiallycontainthe ��

7

433

nearestneighborfor a point in ³ areaddedto ³ .neighbors(noteÑ that ³ .neighborscontains³ ).

5.5 Computing Outliers fr om CandidatePartitionsIn the final step,we computethe top % outliers from thecandidatepartitionsin candSet.If pointsin all thecandidatepartitionsand their neighborsfit in memory, then we cansimply loadall thepointsinto amainmemoryspatialindex.Theindex-basedalgorithm(seeFigure1)canthenbeusedtocomputethe % outliersby probingtheindex to compute�"�valuesonly for pointsbelongingto thecandidatepartitions.Sinceboth the size of the index aswell as the numberofcandidatepoints will in generalbe small comparedto thetotalnumberof pointsin thedataset,thiscanbeexpectedtobemuchfasterthanexecutingtheindex-basedalgorithmontheentiredatasetof points.

In the casethat all the candidatepartitions and theirneighborsexceedthesizeof mainmemory, thenweneedtoprocessthe candidatepartitionsin batches.In eachbatch,a subsetof the remainingcandidatepartitions that alongwith theirneighborsfit in memory, is chosenfor processing.Dueto spaceconstraints,wereferthereaderto [RRS98]fordetailsof thebatchprocessingalgorithm.

6 Experimental ResultsWe empiricallycomparedtheperformanceof our partition-basedalgorithmwith theblocknested-loopandindex-basedalgorithms.In ourexperiments,we foundthatthepartition-basedalgorithmscaleswell with both the dataset size aswell asdatasetdimensionality. In addition,in a numberofcases,it is morethananorderof magnitudefasterthantheblocknested-loopandindex-basedalgorithms.

Webeginby describingin Section6.1ourexperiencewithmining a real-life NBA (National Basketball Association)databaseusingour notion of outliers. The resultsindicatethe efficacy of our approachin finding “interesting” andsometimesunexpectedfactsburied in the data. We thenevaluatethe performanceof the algorithmson a classofsyntheticdatasetsin Section6.2. The experimentswereperformedon a SunUltra-2/200workstationwith 512 MBof mainmemory, andrunningSolaris2.5.Thedatasetswerestoredona localdisk.

6.1 Analyzing NBA StatisticsWe analyzedthe statisticsfor the 1998 NBA seasonwithour outlier programsto seeif it could discover interestingnuggetsin thosestatistics. We had information aboutallM�Ò�( NBA playerswho playedin theNBA duringthe1997-1998season.In orderto restrictour attentionto significantplayers,we removed all playerswho scoredlessthen ( �^�points over the courseof the entire season. This left uswith Ó�Ó�Ô players. We then wantedto ensurethat all thecolumnswere given equalweight. We accomplishedthisby transformingthe value H in a column to ÕeÖt×ÕØCÙ whereÚH is the averagevalue of the column and Û Õ its standard

deviation. This transformationnormalizesthe column tohaveanaverageof

�anda standarddeviationof ( .

We thenranour outlierprogramon thetransformeddata.We useda value of ( � for � and looked for the top Ôoutliers. The resultsfrom someof the runs are shown inFigure 5. (findOuts.pl is a perl front end to the outliersprogramthat understandsthe namesof the columnsin theNBA database.It simply processesits argumentsandcallsthe outlier program.) In additionto giving theactualvaluefor a column, the output also prints the normalizedvalueusedin theoutliercalculation.Theoutliersarerankedbasedon their �"� valueswhicharelistedundertheDIST column.

The first experiment in Figure 5 focuseson the threemostcommonlyusedaverage statisticsin the NBA: aver-agepointspergame,averageassistspergameandaveragereboundspergame.Whatstandsout is theextent to whichplayershaving a largevaluein onedimensiontendto dom-inatein the outlier list. For instance,DennisRodman,notknown to excel in eitherassistingor scoring, is neverthe-lessthe top outlier becauseof his huge(nearly4.4 sigmas)deviation from the averageon rebounds.Furthermore,hisDIST value is much higher than that for any of the otheroutliers,thusmakinghim anextremelystrongoutlier. Twootherplayersin thisoutlier list alsotendto dominatein oneor two columns. An interestingcaseis that of ShaquilleO’ Nealwho madeit to theoutlier list dueto his excellentrecordin bothscoringandrebounds,thoughheis quiteaver-ageon assists.(Recallthattheaverageof every normalizedcolumnis

�.) The first “well-rounded”playerto appearin

this list is Karl Malone,at position5. (MichaelJordanis atposition7.) In fact,in thelist of thetop25outliers,thereareonly two players,Karl MaloneandGrantHill (at positions5 and6) that have normalizedvaluesof morethan ( in allthreecolumns.

Whenwelook atmoredefensivestatistics,theoutliersareonceagaindominatedby playershaving large normalizedvaluesfor asinglecolumn.Whenweconsideraveragestealsandblocks,theoutliersaredominatedby shotblockerslikeMarcusCamby. HakeemOlajuwon,at position5, showsupasthefirst “balanced”playerdueto hisaboveaveragerecordwith respectto bothstealsandblocks.

In conclusion,weweresomewhatsurpriseby theoutcomeof our experimentson the NBA data. First, we found thatvery few “balanced”players(that is, playerswho areaboveaveragein every aspectof thegame)arelabeledasoutliers.Instead,theoutlier listsaredominatedby playerswhoexcelby a wide margin in particularaspectsof the game(e.g.,DennisRodmanonrebounds).

Another interestingobservation we madewas that theoutliers found tended to be more interesting when weconsideredfewerattributes(e.g.,2 or 3). This is notentirelysurprisingsinceit is a well-known fact that asthe numberof dimensionsincreases,pointsspreadout moreuniformlyin the dataspaceand distancesbetweenthem are a poormeasureof their similarity/dissimilarity.

8

434

->findOuts.pl -n 5 -k 10 reb assists ptsNAME DIST avgReb (norm) avgAssts (norm) avgPts (norm)

Dennis Rodman 7.26 15.000 (4.376) 2.900 (0.670) 4.700 (-0.459)Rod Strickland 3.95 5.300 (0.750) 10.500 (4.922) 17.800 (1.740)

Shaquille Oneal 3.61 11.400 (3.030) 2.400 (0.391) 28.300 (3.503)Jayson Williams 3.33 13.600 (3.852) 1.000 (-0.393) 12.900 (0.918)

Karl Malone 2.96 10.300 (2.619) 3.900 (1.230) 27.000 (3.285)

->findOuts.pl -n 5 -k 10 steal blocksNAME DIST avgSteals (norm) avgBlocks (norm)

Marcus Camby 8.44 1.100 (0.838) 3.700 (6.139)Dikembe Mutombo 5.35 0.400 (-0.550) 3.400 (5.580)Shawn Bradley 4.36 0.800 (0.243) 3.300 (5.394)Theo Ratliff 3.51 0.600 (-0.153) 3.200 (5.208)

Hakeem Olajuwon 3.47 1.800 (2.225) 2.000 (2.972)

Figure5: FindingOutliersfrom a 1998NBA StatisticsDatabase

Finally, while wewereconductingourexperimentsontheNBA database,we realizedthatspecifyingactualdistances,as is required in [KN98], is fairly difficult in practice.Instead, our notion of outliers, which only requires usto specify the � -value used in calculating � ’ th neighbordistance,is much simpler to work with. (The resultsarefairly insensitive to minor changesin � , makingthe job ofspecifyingit easy.) Notealsotherankingfor playersthatweprovide in Figure5 basedon distance—this enablesus todeterminehow stronganoutlier really is.

6.2 PerformanceResultson SyntheticDataWe begin this sectionby briefly describingour implemen-tation of the threealgorithmsthatwe used.We thenmoveontodescribingthesyntheticdatasetsthatweused.

6.2.1 Algorithms ImplementedBlock Nested-Loop Algorithm: This algorithm wasdescribedin Section4. In orderto optimizetheperformanceof this algorithm,we implementedour own buffer managerandperformedreadsin largeblocks.We allocatedasmuchbuffer spaceaspossibleto theouterloop.

Index-BasedAlgorithm: To speedup execution,an R� -tree was usedto find the � nearestneighborsfor a point,asdescribedin Section4. TheR� -treecodewasdevelopedat the University of Maryland.Ü The R� -treewe usedwasa main memory-basedversion. The pagesize for the R� -tree was set to 1024 bytes. In our experiments,the R� -treealwaysfit in memory. Furthermore,for theindex-basedalgorithm,wedid not includethetime to build thetree(thatis, insertdatapoints into the tree) in our measurementsofexecutiontime. Thus,ourmeasurementfor therunningtimeof theindex-basedalgorithmonly includes the CPU time formain memory search. Note that this givesthe index-basedalgorithmanadvantageover theotheralgorithms.Ý

Our thanksto ChristosFaloutsosfor providing uswith thiscode.

Partition-Basedalgorithm: Weimplementedourpartition-basedalgorithmasdescribedin Section5. Thus,we usedBIRCH’s pre-clusteringalgorithmfor generatingpartitions,the main memoryR� -tree to determinethe candidatepar-titions and the block nested-loopalgorithmfor computingoutliersfrom the candidatepartitionsin the final step. Wefound that for the final step,the performanceof the blocknested-loopalgorithmwascompetitivewith theindex-basedalgorithmsincethepreviouspruningstepsdid a very goodjob in identifying the candidatepartitionsandtheir neigh-bors.

We configuredBIRCH to provide a boundingrectanglefor eachcluster it generated. We usedthis as the MBRfor the correspondingpartition. We storedthe MBR andnumberof points in eachpartition in an R� -tree. We usedthe resulting index to identify candidateand neighboringpartitions.Sinceweneededto identify thepartitionto whichBIRCH assigneda point, we modifiedBIRCH to generatethis information.

Recall from Section5.2 that an importantparametertoBIRCH is the amountof memoryit is allowed to use. Inthe experiments,we specify this parameterin termsof thenumberof clustersor partitionsthat BIRCH is allowed tocreate.

6.2.2 SyntheticData SetsFor our experiments,we usedthe grid syntheticdatasetthat was employed in [ZRL96] to study the sensitivity ofBIRCH. The datasetcontains100hyper-sphericalclustersarrangedasa ( �LÞ ( � grid. The centerof eachclusterislocatedat �;( � ]ucb( �Cß for (Êoà]�oá( � and (âo ß oá( � .Furthermore,eachclusterhasa radiusof 4. Datapointsfora clusterareuniformly distributedin the hyper-spherethatdefinesthecluster. Wealsouniformly scattered1000outlierpointsin thespacespanning0 to 110alongeachdimension.Table 2 shows the parametersfor the dataset, alongwiththeir default valuesand the rangeof valuesfor which weconductedexperiments.

9

435

Parameter Default Value Rangeof ValuesNumberof Points(

3) 101000 11000to 1 million

Numberof Clusters 100Numberof PointsperCluster 1000 100to 10000Numberof Outliersin DataSet 1000Numberof Outliersto beComputed( 2 ) 100 100to 500Numberof Neighbors(

-) 100 100to 500

Numberof Dimensions(4) 2 2 to 10

Maximumnumberof Partitions 6000 5000to 15000DistanceMetric euclidean

Table2: SyntheticDataParameters

1

10

100

1000

10000

100000

11000 26000 51000 76000 101000

Exe

cutio

n T

ime

(sec

.)ã

N

Block Nested-LoopIndex-Based

Partition-Based (6k)

1

10

100

1000

101000 251000 501000 751000 1.001e+06

Exe

cutio

n T

ime

(sec

.)ã

N

Total Execution TimeClustering Time Only

(a) (b)

Figure6: PerformanceResultsfor @6.2.3 PerformanceResultsNumber of Points: To study how the three algorithmsscalewith datasetsize,we variedthenumberof pointsperclusterfrom100to10,000.Thisvariesthesizeof thedatasetfrom 11000to approximately1 million. Both % and � weresetto their default valuesof 100. The limit on thenumberof partitions for the partition-basedalgorithm was set to6000. The executiontimes for the threealgorithmsas @is variedfrom 11000to 101000areshown usinga log scalein Figure6(a).

As thefigure illustrates,theblock nested-loopalgorithmis the worst performer. Sincethe numberof computationsit performsis proportionalto the squareof the numberofpoints,it exhibits a quadraticdependency on theinput size.Theindex-basedalgorithmis a lot betterthanblock nested-loop, but it is still 2 to 6 times slower than the partition-basedalgorithm. For 101000 points, the block nested-loopalgorithmtakesabout5 hoursto compute100outliers,the index-basedalgorithm less than 5 minuteswhile thepartition-basedalgorithmtakesabouthalf aminute.In orderto explain why the partition-basedalgorithm performssowell, we presentin Table3, the numberof candidateandneighborpartitionsaswell aspointsprocessedin the finalstep. From the table, it follows that for @ lä( � ( �^�� ,out of theapproximately6000initial partitions,only about160 candidatepartitionsand 1500 neighborpartitionsareprocessedin thefinal phase.Thus,about75%of partitions

areentirelyprunedfrom thedataset,andonly about0.25%of thepointsin thedatasetarecandidatesfor outliers(230outof 101000points).Thisresultsin tremendoussavingsinboth I/O andcomputation,andenablesthe partition-basedschemeto outperformtheothertwo algorithmsby almostanorderof magnitude.

In Figure 6(b), we plot the execution time of onlythe partition-basedalgorithm as the numberof points isincreasedfrom 100,000to 1 million to seehow it scalesformuchlargerdatasets.Wealsoplot thetimespentby BIRCHfor generatingpartitions—from the graph,it follows thatthis increasesaboutlinearly with input size. However, theoverheadof thefinal stepincreasessubstantiallyasthedataset size is increased.The reasonfor this is that sincewegeneratethe samenumber, 6000, of partitionseven for amillion points, the averagenumberof points per partitionexceeds � , which is 100. As a result, computedlowerboundsfor partitionsarecloseto0 andminDkDist,thelowerboundon the �� value for an outlier is low, too. Thus,our pruningis lesseffective if thedatasetsizeis increasedwithoutacorrespondingincreasein thenumberof partitions.Specifically, in ordertoensureahighdegreepruning,agoodruleof thumbis to choosethenumberof partitionssuchthattheaveragenumberof pointsperpartitionis fairly small(butnot too small) comparedto � . For example, @æå��?�PåZÔ^ is agoodvalue.Thismakestheclustersgeneratedby BIRCH tohaveanaveragesizeof �çåZÔ .

10

436

3Avg. # of Points # of Candidate # of Neighbor # of Candidate # of Neighbor

perPartition Partitions Partitions Points Points11000 2.46 115 1334 130 326626000 4.70 131 1123 144 525651000 8.16 141 1088 159 885076000 11.73 143 963 160 11273101000 16.39 160 1505 230 24605

Table3: Statisticsfor @�� Nearest Neighbor: Figure7(a) shows the resultofincreasingthe valueof � from 100 to 500. We consideredtheindex-basedalgorithmandthepartition-basedalgorithmwith 3 different settingsfor the number of partitions—5000, 6000 and 15000. We did not explore the behaviorof the block-nestedloop algorithmbecauseit is very slowcomparedto theothertwo algorithms.Thevalueof % wasset to 100 and the numberof points in the data set was101000.Theexecutiontimesareshown usinga log scale.

As thegraphconfirms,theperformanceof thepartition-basedalgorithmsdo not degradeas � is increased. Thisis becausewe found that as � is increased,the numberof candidatepartitionsdecreasesslightly sincea larger �impliesa highervaluefor minDkDistwhich resultsin morepruning.However, a larger � alsoimpliesmoreneighboringpartitionsfor eachcandidatepartition. Thesetwo opposingeffects canceleachother to leave the performanceof thepartition-basedalgorithmrelatively unchanged.

On the otherhand,due to the overheadassociatedwithfinding the � nearestneighbors,the performanceof theindex-basedalgorithmsufferssignificantlyasthevalueof �increases.Sincethepartition-basedalgorithmsprunemorethan 75% of points and only 0.25% of the data set arecandidatesfor outliers, they are generally10 to 70 timesfasterthantheindex-basedalgorithm.

Also, note that as the numberof partitionsis increased,the performanceof the partition-basedalgorithmbecomesworse. Thereasonfor this is thatwheneachpartitioncon-tainstoo few points,thecostof computinglower andupperboundsfor eachpartition is no longer low. For instance,in caseeachpartitioncontainsa singlepoint, thencomput-ing lower andupperboundsfor a partition is equivalenttocomputing � � for every point in the datasetandso thepartition-basedalgorithmdegeneratesto theindex-basedal-gorithm. Therefore,as we increasethe numberof parti-tions, the executiontime of the partition-basedalgorithmconvergesto thatof theindex-basedalgorithm.

Number of outliers: When the numberof outliers % ,is varied from 100 to 500 with default settingsfor otherparameters,we found that the the execution time of allalgorithmsincreasegradually. Dueto spaceconstraints,wedonotpresentthegraphsfor theseexperimentsin thispaper.Thesecanbefoundin [RRS98].

Number of Dimensions: Figure7(b) plots the executiontimesof thethreealgorithmsasthenumberof dimensionsisincreasedfrom 2 to 10 (theremainingparametersaresettotheirdefaultvalues).While wesettheclusterradiusto 4 forthe 2-dimensionaldataset,we reducedthe radii of clustersfor higherdimensions.We did this becausethe volumeofthe hyper-spheresof clusterstendsto grow exponentiallywith dimension,andthusthe points in higherdimensionalspacebecomevery sparse. Therefore,we had to reducethe clusterradiusto ensurethat points in eachclusterarerelatively closecomparedto points in other clusters. Weusedradiusvaluesof 2, 1.4, 1.2 and1.2, respectively, fordimensionsfrom 4 to 10.

For 10dimensions,thepartition-basedalgorithmis about30 times fasterthan the index-basedalgorithm and about180timesfasterthantheblocknested-loopalgorithm.Notethatthiswaswithout includingthebuilding time for theR� -tree in the index-basedalgorithm. The executiontime ofthe partition-basedalgorithm increasessub-linearlyas thedimensionalityof the data set is increased. In contrast,running times for the index-basedalgorithmincreaseveryrapidly dueto the increasedoverheadof performingsearchin higherdimensionsusingtheR� -tree.Thus,thepartition-basedalgorithmscalesbetterthanthe otheralgorithmsforhigherdimensions.

7 ConclusionsIn thispaper, weproposedanovel formulationfor distance-basedoutliers that is basedon the distanceof a pointfrom its �� nearestneighbor. We rank each point onthe basisof its distanceto its �� nearestneighboranddeclarethe top % points in this ranking to be outliers. Inaddition to developing relatively straightforward solutionsto finding suchoutliersbasedon the classicalnested-loopjoin and index join algorithms, we developed a highlyefficient partition-based algorithmfor miningoutliers.Thisalgorithm first partitions the input data set into disjointsubsets,andthenprunesentirepartitionsassoonasit canbedeterminedthat they cannotcontainoutliers. Sincepeopleareusuallyinterestedin only asmallnumberof outliers,ouralgorithmis ableto determineveryquickly thatasignificantnumberof theinputpointscannotbeoutliers.Thisresultsinsubstantialsavingsin computation.

We presentedthe resultsof an extensive experimentalstudyon real-lifeandsyntheticdatasets.Theresultsfrom a

11

437

1

10

100

1000

10000

100 200 300 400 500

Exe

cutio

n T

ime

(sec

.)ã

k

Index-BasedPartition-Based (15k)Partition-Based (6k)Partition-Based (5k)

1

10

100

1000

10000

100000

2 4 6 8 10

Exe

cutio

n T

ime

(sec

.)ã

Number of Dimensions

Block Nested-LoopIndex-Based

Partition-Based (6k)

(a) (b)

Figure7: PerformanceResultsfor � and >real-lifeNBA databasehighlightandrevealseveralexpectedandunexpectedaspectsof thedatabase.Theresultsfrom astudyon syntheticdatasetsdemonstratethat the partition-basedalgorithmscaleswell with respectto bothdatasetsizeand dataset dimensionality. Furthermore,it outperformsthenested-loopandindex-basedalgorithmsby morethananorderof magnitudefor a widerangeof parametersettings.

Acknowledgments: Without the supportof SeemaBansalandYesookShim, it would have beenimpossibleto com-pletethiswork.

References[AAR96] A. Arning, RakeshAgrawal, andP. Raghavan. A lin-

earmethodfor deviation detectionin largedatabases.In Int’l Conference on Knowledge Discovery inDatabases and Data Mining (KDD-95), Portland,Oregon,August1996.

[AMS è 95] Rakesh Agrawal, Heikki Mannila, RamakrishnanSrikant,HannuToivonen,andA. Inkeri Verkamo.FastDiscovery of Association Rules, chapter14. 1995.

[BKNS00] MarkusM. Breunig,Hans-PeterKriegel,RaymondT.Ng, and Jorg Sander. Lof:indetifying density-basedlocal outliers. In Proc. of the ACM SIGMOD Confer-ence on Management of Data, May 2000.

[BKSS90] N. Beckmann, H.-P. Kriegel, R. Schneider, andB. Seeger. The é -tree: an efficient and robust ac-cessmethodfor points and rectangles. In Proc. ofACM SIGMOD, pages322–331,Atlantic City, NJ,May 1990.

[BL94] V. BarnettandT. Lewis. Outliers in Statistical Data.JohnWiley andSons,New York, 1994.

[EKX95] Martin Ester, Hans-PeterKriegel, and Xiaowei Xu.A databaseinterface for clustering in large spatialdatabases.In Int’l Conference on Knowledge Discov-ery in Databases and Data Mining (KDD-95), Mon-treal,Canada,August1995.

[GRS98] SudiptoGuha, Rajeev Rastogi,and KyuseokShim.Cure: An efficient clustering algorithm for largedatabases.In Proc. of the ACM SIGMOD Conferenceon Management of Data, June1998.

[JD88] Anil K. JainandRichardC. Dubes. Algorithms forClustering Data. PrenticeHall, Englewood Clif fs,New Jersey, 1988.

[KN98] Edwin Knorr and RaymondNg. Algorithms formining distance-basedoutliers in large datasets. InProc. of the VLDB Conference, pages392–403,NewYork, USA, September1998.

[KN99] Edwin Knorr andRaymondNg. Finding intensionalknowledgeof distance-basedoutliers. In Proc. of theVLDB Conference, pages211–222,Edinburgh, UK,September1999.

[NH94] Raymond T. Ng and Jiawei Han. Efficient andeffective clusteringmethodsfor spatialdatamining.In Proc. of the VLDB Conference, Santiago,Chile,September1994.

[RKV95] N. Roussopoulos,S. Kelley, andF. Vincent. Nearestneighborqueries. In Proc. of ACM SIGMOD, pages71–79,SanJose,CA, 1995.

[RRS98] SridharRamaswamy, Rajeev Rastogi,and KyuseokShim. Efficient algorithmsfor mining outliers fromlarge datasets. Technicalreport,Bell Laboratories,MurrayHill, 1998.

[RS98] Rajeev RastogiandKyuseokShim.Public:A decisiontreeclassifierthat integratesbuilding andpruning. InProc. of the Int’l Conf. on Very Large Data Bases,New York, 1998.

[Sam89] H. Samet. The Design and Analysis of Spatial DataStructures. Addison-Wesley, 1989.

[SAM98] S.Sarawagi,R.Agrawal,andN. Megiddo.Discovery-driven exploration of olap data cubes. In Proc. ofthe Sixth Int’l Conference on Extending DatabaseTechnology (EDBT), Valencia,Spain,March1998.

[ZRL96] Tian Zhang,RaghuRamakrishnan,andMiron Livny.Birch: An efficient dataclusteringmethodfor verylargedatabases.In Proceedings of the ACM SIGMODConference on Management of Data, pages103–114,Montreal,Canada,June1996.

12

438

Documents

Efficient algorithms for mining outliers from large data sets