12
o 0 r 1 v r v3 dim2 dim1 Wang, K. Scott, D. Ren) “Outlier Detect with Local Pruning”, ACM Conf on Info and Knowledge Mgmt, ACM CIKM 2004, Nov., 2004, Washington, D.C., (with D. Ren). Download “RDF: A Density-based Outlier Detection Meth using Vertical Data Rep”, IEEE ICDM 2004, Nov., 2004, Brighton, U.K., (with D. Ren, B. Wang). Download “A Vert Outlier Detect Method with Clusters as a By-Product”, Intl Conf. On Tools in Artificial Intell, IEEE ICTAI 2004, Boca Raton, FL, (with D. Ren). Download Outlier_Filter_1: dis(dataset_mean, dataset_vom)>threshold1, and standard deviation<thresh2 (suggests uni-modal?) then [likely] there are outliers (the outliers are pts furthest from mean. Mask with P(dist(mean,x)>thresh3). March thresh3 down from or consider any pt > 2 stds from the mean to be an outlier (in the direction of vector(mean-->vom) only)??? Use pTrees to mask each mode? Then each mode is a cluster. How do we determine modes using pTree formulas (assuming no looping scans)? Find series of vectors of percentiles, vop12.5%, vop25%...vop87.5% (note vop25%=1st_quartile, vop50%=median,...). Series may reveal modes. Note: this does not seem to work. I include it here just so no one else repreats this analysis. You may skip to the next slid otherwise. How do we decide when there is sufficient multimodality to conclude that d(mean,vom)>thr1 does not necessarily imply outliers? Or?, use series to determine where modes are (if the delta on both sides of a vop or int of vops is large there's a mode there?). Once mode ctrs (and mode tail radii?) are determined (using those vops) then one can use that info to isolate outliers within each mode region? wo modes, the r-mode and the v-mode. Outliers at o s ? Find vops. vop25 2 vom 2 = vop5 0 2 vop75 2 vop12.5 2 vop37.5 2 vop62.5 2 vop87.5 2 v o p 5 0 v o p 2 5 v o p 7 5 v o p 1 2 . v o p 3 7 . v o p 6 2 . v o p 8 7 . vop12.5 vop25 vop37.5 vop50 vop62.5 vop75 vop87.5 mean nd vom are close to coincident, so from that analysis, existence of outliers is not established. Have 2 modes: vop25-centered mode, mode25, and vop75- centered mode, mode75. 1. Remove mode25 pts (that contrib all coords to mode25 vop's) 2. From what's left compare mean and vom. Quite different; outliers = pts far from vom. OutlierSet={o} 3. Remove mode75 pts, find mean and vom. (doesn't work because it depends on which direction we count precentages (top-down/bottom-up). vom mean Do it coordinate at a time (like best_gap FAUST). Instead of vop, use percentages in each coordinate. If the percentage points in any coordinate show an outlier, then there is one there...

O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

Embed Size (px)

Citation preview

Page 1: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

o 0

r1

  v r                  v3

     r    v2 v

dim2

dim1

Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier Detection Alg”, ISCA CAINE 2004, Orlando, FL, Nov., 2004 (with B. Wang, D. Ren) CAINE - 2004.pdf 124K   Download   “A Cluster-based Outlier Detection Method with Efficient Pruning”, ISCA CAINE, Nov., 2004 (with B. Wang, D. Ren) “A Density-based Outlier Detection Algorithm using Pruning Techniques”, ISCA CAINE 2004, Nov., 2004 (with B. Wang, K. Scott, D. Ren) “Outlier Detect with Local Pruning”, ACM Conf on Info and Knowledge Mgmt, ACM CIKM 2004, Nov., 2004, Washington, D.C., (with D. Ren).    Download “RDF: A Density-based Outlier Detection Meth using Vertical Data Rep”, IEEE ICDM 2004, Nov., 2004, Brighton, U.K., (with D. Ren, B. Wang). Download  “A Vert Outlier Detect Method with Clusters as a By-Product”, Intl Conf. On Tools in Artificial Intell, IEEE ICTAI 2004, Boca Raton, FL, (with D. Ren). Download  

Outlier_Filter_1: dis(dataset_mean, dataset_vom)>threshold1, and standard deviation<thresh2 (suggests uni-modal?) then [likely] there are outliers (the outliers are pts furthest from mean. Mask with P(dist(mean,x)>thresh3).March thresh3 down from or consider any pt > 2 stds from the mean to be an outlier (in the direction of vector(mean-->vom) only)???

Use pTrees to mask each mode? Then each mode is a cluster. How do we determine modes using pTree formulas (assuming no looping scans)? Find series of vectors of percentiles, vop12.5%, vop25%...vop87.5% (note vop25%=1st_quartile, vop50%=median,...). Series may reveal modes. Note: this does not seem to work. I include it here just so no one else repreats this analysis. You may skip to the next slid otherwise.How do we decide when there is sufficient multimodality to conclude that d(mean,vom)>thr1 does not necessarily imply outliers?Or?, use series to determine where modes are (if the delta on both sides of a vop or int of vops is large there's a mode there?). Once mode ctrs (and mode tail radii?) are determined (using those vops) then one can use that info to isolate outliers within each mode region?

Suppose two modes, the r-mode and the v-mode. Outliers at os? Find vops.

vop252

vom2 = vop502

vop752

vop12.52

vop37.52

vop62.52

vop87.52

vop501

vop251

vop751

vop12.51

vop37.51

vop62.51

vop87.51

vop12.5vop25

vop37.5

vop50

vop62.5

vop75vop87.5

mean

The overall mean and vom are close to coincident, so from that analysis, existence of outliers is not established.

Have 2 modes: vop25-centered mode, mode25, and vop75-centered mode, mode75.1. Remove mode25 pts (that contrib all coords to mode25 vop's)2. From what's left compare mean and vom. Quite different; outliers = pts far from vom. OutlierSet={o}3. Remove mode75 pts, find mean and vom.(doesn't work because it depends on which direction we count precentages (top-down/bottom-up).

vom

mean

Do it coordinate at a time (like best_gap FAUST).Instead of vop, use percentages in each coordinate.If the percentage points in any coordinate show an outlier, then there is one there...

Page 2: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

o 0

r1

  v1 r2                  v2

     r3    v3

v4

dim2

dim1

Again, assuming 2 modes, the r-mode and the v-mode, how do we find those clusters and the outliers (which are singleton clusters)?

Look for dimension in which clustering is best. In this case it's dim1 (3 clusters, {r1,r2,r3,O}, {v1,v2,v3,v4} and {0}. How do we determine that? Take each dim in turn and do: working left to right, when d(mean,median)>¼ width, declare a cluster.Next take those clusters one at a time to the next dimension for further sub-clustering via the same algorithm. Starting with dim1 then:

mean

median

mean

median

At this point we declare {r1,r2,r3,O} a cluster and start over.

mean

median

mean

median

mean

median

mean

median

At this point we need to declare a cluster, but which one, {0,v1} or {v1,v2}? We will always take the one on the median side of the mean - in this case, {v1,v2}. And that makes {0} a cluster (actually an outlier, since it's singleton). Continuing with {v1,v2}:

mean

median

mean

median

Declare {v1,v2,v3,v4} a cluster. Note we have to loop. However, rather than each single projection, delta can be the next m projs if they're close.Potential problem: What if 2 outliers in a row? Does this algorithm consider them as a doubleton cluster or two separate singleton outliers??

Next we would go take one of the clusters and go to the best dimension to separate that cluster into subclusters, etc.

Is there an Oblique version of this? Yes. We could consider a grid of Oblique direction vectors, e.g., For a 3-column dataset, consider one pointing to the center of each PTM triangle and look at the projections onto those lines for the best clustering. Then for each cluster we get, continue by projection again but only with those cluster points (for the best subclustering of those), etc. See next slide for PTM.

All that is needed is any ordering (not necessarily PTM) of a grid of points on the surface of the sphere. Consider the n-sphere, Sn≡{ x≡(x1...xn)Rn | xi

2=1 } which, in polar coordinates is, { p≡(θ1...θn-1) | 0 θi 179 }. We can use lexicographical ordering of the polar coordinates: 0...00, 0...01, . . ., 0...179, . . ., 10...0, 110...0, 1110...0, . . . 179...179. That would be 180 n vectors. Too many? If so use any other units (than degrees), e.g., units of 30 degrees (so 0...5 for each angle), giving 6n vectors, for dim=n. Attribute relevance analysis is impotant

Can always skip doubletons, since always mean=median.

Algorithm-2:

Another variation of this is to calculate the dataset mean and vector of medians. Then on the projections of the dataset onto the line connecting the two, do the algorithm above. Then repeat on each declared cluster (maybe using a projection line other than the one through the mean and vom, this second time, since that line would likely be in approx the same direction as the first) Do this until there are no new clusters?

This may need to be adjusted in several ways, including choice of the subsequent projection lines and the stopping condition,...See the slide after the next one for an example.

Page 3: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

PTM (pTree Triangular Mesh) decomposition of a Sphere

Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point.

RA

dec

The trixelization in the following ordering produces a sphere-surface filling curve with good continuity characteristics, The picture at right shows the center (blue ball) and the sphere out around it.

left turn

right

left

right

Equilateral triangle (90o sector) bounded by longitudinal and equatorial line segments

Traverse the next level of triangulation, alternating again with left-turn, right-turn, left-turn, right-turn..

Next, traverse the southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point.

Page 4: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

4,9 2,8 5,8   4,6         3,4

dim2

dim1

Again, assuming 2 modes, how do we find clusters (and outliers) in dimension 1?

Work left to right, compare mean and median. When d(mean,median)>¼ width, declare a cluster.

Algorithm-2: a. Calculate the dataset mean and vector of medians: mn=6.3,5.9 vom=6,5.5 Project on the line connecting them. Do the algorithm above. (which isolates 11,10 as an outlier).b. Then repeat on any perpendicular line through the mean on each declared cluster (just one non outlier cluster, set of all blue and all green points). Isolates blue cluster and green cluster.Note: If mn and vom are far apart it should imply multi-modality (either an outlier or small remote cluster(s) in the direction of the vom. Once the situation is determined, each cluster can be treated separately.)

11,10

10,5 9,4

  8,3 7,2

6.3,5.96,5.5

4,9 2,8 5,8   4,6         3,4

dim2

dim1

10,5 9,4

  8,3 7,2

Algorithm-2.1: a.1 Calculate the dataset mean and vector of medians: mn=6.3,5.9 vom=6,5.5 Project onto line connecting them, do the algorithm above. (isolates 11,10 as an outlier).b.1 In each cluster, find the 2 points furthest from that line? Use the line thru those 2 pts next.Would this require that the projection be done one point at a time? Or can we determine those 2 points in one pTree formula?

4,9 2,8 5,8   4,6         3,4

dim2

dim1

10,5 9,4

  8,3 7,2

Algorithm-2.2: a.2 Calculate the dataset mean and vector of medians: mn=6.3,5.9 vom=6,5.5 Project onto line connecting them, do the algorithm above. (isolates 11,10 as an outlier).b.2 use a grid of unit direction vectors, {dvi | i=1..m}. For each, calculate the mean and vom of the projections of each cluster (except the singletons) onto that line. Take the one for which the separation is maximum.

Page 5: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

The PTreeSet Genius for Big DataBig Data is where it's at today! Querying BigData is where DBMSs are at today.Underneath, a good foundation is needed. Our foundations is Big Vertical DataThe abstract data type, the PTreeSet (Dr. Greg Wettstein's invention!), is a perfect residualization of BVD! (for both DB querying and datamining? -

since as data structures, PTreeSets are both horizontal and vertical.)PTreeSets incl methods for horiz query, vert DM, multihopQuery/DM, XMLA tbl, T(A1...An) as a PTreeSet data structure = bit matrix with (typically) each numeric attr converted to fixedpt(?), (negative numers??) bitsliced

(pt_posschema) and each category attr bitmapped; coded then bitmapped or numerically coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet? (dates?) (addresses?)

Let A1..Ak numeric with bitwidths=bw1..bwk (0-based) and Ak+1..An categorical with category-counts=cck+1...ccn, the PTreeSet is the bitmatrix:

0

1

0

1

0

0

1

A1,bw1

0

0

0

0

0

1

0

1

0

1

0

0

1

0

row number

N

...

5

4

3

2

1

A1,bw1-1 ... A1,0

0

0

0

0

0

0

0

A2,bw2

0

1

0

1

0

0

1

...

0

0

0

0

0

1

0

Ak+1,c1

0

0

1

0

0

1

0

...An,ccn

Methods for this data structure can provide fast horizontal row access , e.g., anFPGA could (with zero delay) convert each bit-row back to original data row.

Methods already exist to provide vertical (level-0 or raw pTree) access.

Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization) and a row predicate (e.g., 50% 1-bits ).

Add "level-1 only" DM methods, e.g., an FPGA device converts unclassified rowsets to equiwidth=64, 50% level1 pTrees, then the entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN.

1

0

1

1

A1,bw1

0

0

1

0

0

0

1

1

inteval number

roof(N/64)

...

2

1

A1,bw1-1 ... A1,0

0

0

0

0

A2,bw2

1

0

0

1

...

0

0

0

1

Ak+1,c1

1

0

1

0

...An,ccn

pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bit-col would makes it impossible to break the code by

simply focusing on the first bit row.

AH

G(P

,bpp)

00

1

10

0

00

1

10

0

00

00

11

10

0

00

00

01

00

00

01

01

00

00

00

00

11

01

00

P 7B ... 5 4 3 2 1

12

3B

PP

45

...3B

Relationships (rolodex cards) such as AdenineHumanGenome, are 2 PTreeSets, AHGPeoplePTreeSet (shown) and the AHGBasePairPositionPTreeSet (the rotation of the one shown).

Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table (cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination?

Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row methods (using FPGAs) instantaneous (1 pass across the row pTree)

More security?: all pTrees same (max) depth, and intron-like pads randomly interspersed...

Page 6: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

r   r vv r mR   r      v v v       r    r      v mV v     r    v v    r         v                   

FAUST Oblique FAUST Oblique (our best classifier?)(our best classifier?)

PR=P(X o dR

) < aR

1 pass gives classR pTree

D≡ mRmV

d=D/|D|

Separate class R using midpoint of means (midpoint of means (mommom)) method: Calc a

(mR+(mV-mR)/2)od = a = (mR+mV)/2od (works also if D=mVmR,

d

Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time)Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the

1. vectors_of_median, vomvom, to represent each class, not the mean mV, where vomV ≡(median{v1|vV},

2. mom_std, vom_std methodsmom_std, vom_std methods: project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr and mv

median{v2|vV}, ...)

vomV

v1

v2

vomR

std of distances, vod, from origin

along the d-line

dim 2

dim 1

d-line

Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees)

Page 7: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

pc bc lc cc pe age ht wt

Multi-hop Data Mining (MDM): relationship1 (Buys= B(P,I) ties table1

P=People2 3 4 5

F(P,P)=Friends

0 1 0 11 0 1 00 1 0 01 0 0 1

5432

P

B(P,I)=Buys

0 0 1 00 0 0 00 1 0 00 0 0 1

I=Items2 3 4 5

Define the NearestNeighborVoterSet of {f} using strong R-rules with F in the consequent? A correlation is a relationship.A strong cluster based on several self-relationships (but different relationships, so it's not just strong implication both ways) is

a set that strongly implies itself (or strongly implies itself after several hops (or when closing a loop).

Find all strong, AC, AP, CIFrequent iff ct(PA)>minsup and

Confident iff ct(&pAPp AND &iCPi) > minconf

ct(&pAPp)

Says: "A friend of all A will buy C if all A buy C." (the AND is always AND)

Closures: A freq then A+ freq. AC not conf, then AC- not confct(|pAPpAND&iCPi)>mncf

ct(|pAPp)friend of any in A will buy C if any in A buy C.

ct(|pAPp AND |iCPi)>mncf

ct(|pAPp)

Change to "friend of any in A will buy

something in C if any in A buy C.

(People=P=an axis with descriptive features columns) to table2 (Items), which is tied by relationship2 (Friends=F(P,P) ) to table3 (also P)... Can we do interesting clustering and/or classification on one of the tables using the relationships to define "close" or to define the other notions?

Categorycolorsizewtstorecitystatecountry

Dear Amal, We looked at 2012 cup too and, yes, it would form a good testbed for social media data mining work.  Ya Zhu in our Sat gp is leading on "contests" and is looking at 2012 KDD Cup as well as Heritage Provider Network Health Prize (see kaggle.com).  Hoping also for a nice test bed involving the our Netflix datasets (which you and then Dr. Wettstein prepared as pTrees and all have worked on extensively - Matt and Tingda Lu...). Hoping to find (in the netflix contest related literature) a real-life social network (a social relationship between two copies of the netflix customers such as maybe, facebook friends, that we can use inconjunction with the netflix "rates" relationship between netflix customers and netflix movies.  We would be able to do something with that set up (all as PTreeSet both ways).For those new to dataSURG Dr. Amal Shehan Perera is a Senior Professor in Sri Lanka and was a lead researcher in our group for many yrs. 

He is the architect of using GAs to win the KDD Cup in both 2002 and 2006.  He gets most of the credit for those wins, as it was definitely GA work in both cases that pushed us over the top (I believe anyway).  He's the best!!  You would be wise to stay in touch with him.

Sat, Mar 24, Amal Shehan Perera <[email protected]: Just had a peek into the slides last week and saw a request for social media data. Just wanted to point out that the 2012 KDD Cup is on social media data. I haven't had a chance to explore data yet. If I do I will update you. Rgds,-amal

Page 8: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

Bioinformatics Data Mining:Most bioinformatics done so far is not really data mining but is more toward the

database querying side. (e.g., a BLAST search).What would be real Bioinformatics Data Mining (BDM)? A radical approach View whole Human Genome as 4 binary relationships between

People and base-pair-positions (ordered by chromosome first, then gene region?). AHG(P,bpp)

0 0 1

1 0 0

0 0 1

1 0 0

0 0 00 1 1

1 0 0

0 0 0 0

0 1 0 0

0 0 0 1

0 1 0 0

0 0 0 00 0 1 1

0 1 0 0

P7B

...

5

4

3

2

1

1 2 3bpp 4 5 ... 3B

AHG is the relationship between People and adenine (A) (1/0 for yes/no)THG is the relationship between People and thymine (T) (1/0 for yes/no)GHG is the relationship between People and guanine (G) (1/0 for yes/no)CHG is the relationship between People and cytosine (C) (1/0 for yes/no)

Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.) Do it to facilitate cross-organism bioinformatics data mining?

This is a comprehensive view of the human genome (plus other genomes). Create both a People-PTreeSet and PTreeSet vertical human genome DB with

a human health records feature table associated with the people entity.Then use that as a training set for both classification and multi-hop ARM.A challenge would be to use some comprehensive decomposition (ordering of

bpps) so that cross species genomic data mining would be facilitated. On the other hand, if we have separate PTreeSets for each chrmomsome (or even each regioin - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTree databases.

pc bc lc cc pe age ht wt

AH

G(P

,bpp)

00

1

10

0

00

1

10

0

00

00

11

10

0

00

00

01

00

00

01

01

00

00

00

00

11

01

00

P 7B ... 5 4 3 2 1

12

3bpp

45

...3B

The red person features used to define classes. AHGp pTrees for data mining.We can look for similarity (near neighbors) in a particular chromosome, a

particular gene sequence, of overall or anything else.

gene

ch

rom

osom

e

Page 9: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked.

Facebook-Buys:

Members4 3 2 1

F≡Friends(M,M)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Members

P≡Purchase(M,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

XI MX≡&xXPx People that purchased everything in X.

FX≡ORmMXFb = Friends of a MX person.

So, X={x}, is Mx Purchases x strong"Mx=ORmPxFmx frequent if Mx large. This is a tractable calculation.

Take one x at a time and do the OR.Mx=ORmPxFmx confident if Mx large. ct( Mx Px ) / ct(Mx) > minconf

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2 = {1,2,4} P2 = {2,4} ct(K2) = 3ct(K2&P2)/ct(K2) = 2/3

To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves....

ct(ORmPxFm & Px)/ct(ORmPx

Fm)>mncnf

Kx=OR Ogx frequent if Kx large (tractable- one x at a time and OR.gORbPxFb

Kiddos4 3 2 1

F≡Friends(K,B)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Buddies

P≡Purchase(B,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

1

2

3

4

Groupies

Others(G,K)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2={1,2,3,4} P2={2,4} ct(K2) = 4ct(K2&P2)/ct(K2)=2/4

0

1

0

1

4

1

1

1

0

2

1

1

0

1

1

1

2

3

4

Fcbk buddy, b, purchases x, tells friends.

Friend tells all friends.Strong purchase poss?Intersect rather than union

(AND rather than OR). Ad to friends of friends

Kiddos4 3 2 1

F≡Friends(K,B)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Buddies

P≡Purchase(B,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

1

2

3

4

Groupies

Compatriots (G,K)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2={2,4} P2={2,4} ct(K2) = 2ct(K2&P2)/ct(K2) = 2/2

0

1

0

1

4

1

1

0

1

1

1

2

3

4

Page 10: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

R11 1 0 0 0 1 0 1 1

Given a n-row table, a row predicate (e.g., a bit slice predicate, or a category map) and a row ordering (e.g., asc on key; or for spatial data, col/row-raster, Z, Hilbert), the sequence of predicate truth bits is the raw or level-0 predicate Tree (pTree) for that table, row predicate and row order.

Given a raw pTree, P, a partitioned of it, par, and a bit-set predicate, bsp (e.g., pure1, pure0, gte50%One), the level-1 par, bsp pTree is the string of truths of bsp on consecutive partitions of par. If the partition is an equiwidth=m intervalization, it's called the level-1 stride=m bsp pTree.

IRIS TableName SL SW PL PW Colorsetosa 38 38 14 2 redsetosa 50 38 15 2 bluesetosa 50 34 16 2 redsetosa 48 42 15 2 whitesetosa 50 34 12 2 blueversicolor 51 24 45 15 redversicolor 56 30 45 14 redversicolor 57 28 32 14 whiteversicolor 54 26 45 13 blueversicolor 57 30 42 12 whitevirginica 73 29 58 17 whitevirginica 64 26 51 22 redvirginica 72 28 49 16 bluevirginica 74 30 48 22 redvirginica 67 26 50 19 red

P0SL,0

000001011110001

predicate: remainder(SL/2)=1 order: the given table order

P0Color=red

101001100001011

pred: Color=redorder: given ord

P0SL,1

111011001000011

pred: rem(div(SL/2)/2)=1 order: given order

gte50%stride=5P1

SL,1

100

pure1str=5P1

SL,1

000

gte25%str=5P1

SL,1

111

P0PW<71

11110000000000

pred: PW<7order: given

gte50%stride=5P1

PW<7100

gte50% st=5 pTree predicts setosa.

gte75%str=5P1

SL,1

100

gte50%str=5P1

C=red001

pure1str=5P1

C=red000

gte25%str=5P1

C=red111

gte75%str=5P1

C=red001

P0SL,0

0000010111100011

rem(SL/2)=1ord: given

gte50%stride=4P1

SL,0

0111

gte50%stride=8P1

SL,0

01

gte50%stride=16P1

SL,00

lev2 pTree=lev1 pTree on a lev1. (1col tbl)

P0SL,0

0000010111100011

pred: rem(SL/2)=1 ord: given order

P1gte50%,s=4,SL,0≡

gte50%stride=4P1

SL,0

0111

level-2gte50%stride=2

11

P2gte50%,s=4,SL,0

1 0

1

0 0

0

1 0

1

1

1

1 1

1

0

gte50_P11

raw level-0 pTree

level-1 gt50 stride=4 pTree

level-1 gt50 stride=2 pTree

Page 11: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

FAUST Satlogevaluation

R G ir1 ir2 mn62.83 95.29 108.12 89.50 148.84 39.91 113.89 118.31 287.48 105.50 110.60 87.46 377.41 90.94 95.61 75.35 459.59 62.27 83.02 69.95 569.01 77.42 81.59 64.13 7

R G ir1 ir2 std8 15 13 9 18 13 13 19 25 7 7 6 36 8 8 7 46 12 13 13 55 8 9 7 7

Oblique level-0 using midpoint of means 1's 2's 3's 4's 5's 7's

True Positives: 322 199 344 145 174 353

False Positives: 28 3 80 171 107 74

NonOblique lev-0 1's 2's 3's 4's 5's 7's

True Positives: 99 193 325 130 151 257Class actual-> 461 224 397 211 237 470 NonOblq lev1 gt50 1's 2's 3's 4's 5's 7's

True Positives: 212 183 314 103 157 330

False Positives: 14 1 42 103 36 189

Oblique level-0 using means and stds of projections (w/o cls elim)

1's 2's 3's 4's 5's 7's

True Positives: 359 205 332 144 175 324

False Positives: 29 18 47 156 131 58Oblique lev-0, means, stds of projections (w cls elim in 2345671 order) Note that none occurs 1's 2's 3's 4's 5's 7's

True Positives: 359 205 332 144 175 324

False Positives: 29 18 47 156 131 58

a = pmr + (pmv-pmr) =pstdv+2pstdr

2pstdr

pmr*pstdv + pmv*2pstdr pstdr +2pstdv

Oblique level-0 using means and stds of projections, doubling pstd No elimination! 1's 2's 3's 4's 5's 7's

True Positives: 410 212 277 179 199 324

False Positives: 114 40 113 259 235 58Oblique lev-0, means, stds of projs,doubling pstdr, classify, eliminate in 2,3,4,5,7,1 ord 1's 2's 3's 4's 5's 7's

True Positives: 309 212 277 154 163 248

False Positives: 22 40 65 211 196 27

2s1, # of FPs reduced and TPs somewhat reduced. Better? Parameterize the 2 to max TPs, min FPs. Best parameter?

Oblique lev-0, means,stds of projs, doubling pstdr, classify, elim 3,4,7,5,1,2 ord 1's 2's 3's 4's 5's 7's

True Positives: 329 189 277 154 164 307

False Positives: 25 1 113 211 121 33

above=(std+stdup)/gap below=(std+stddn)/gapdnsuggest ord 425713

abv below abv below abv below abv below avg 1 4.33 2.10 5.29 2.16 1.68 8.09 13.11 0.94 4.71 2 1.30 1.12 6.07 0.94 2.36 3 1.09 2.16 8.09 6.07 1.07 13.11 5.27 4 1.31 1.09 1.18 5.29 1.67 1.68 3.70 1.07 2.12 5 1.30 4.33 1.12 1.32 15.37 1.67 3.43 3.70 4.03 7 2.10 1.31 1.32 1.18 15.37 3.43 4.12

red green ir1 ir2

cls avg 4 2.12 2 2.36 5 4.03 7 4.12 1 4.71 3 5.27

2s1/(2s1+s2) elim ord: 425713 TP: 355 205 224 179 172 307

FP: 37 18 14 259 121

33

1 2 3 4 5 7 tot

461 224 397 211 237 470 2000 TP actual

99 193 325 130 151 257 1155 TP nonOb L0 pure1

212 183 314 103 157 330 1037 TP nonOblique 14 1 42 103 36 189 385 FP level-1 50%

322 199 344 145 174 353 1537 TP Obl level-0 28 3 80 171 107 74 463 FP MeansMidPoint

359 205 332 144 175 324 1539 TP Obl level-0 29 18 47 156 131 58 439 FP s1/(s1+s2)

410 212 277 179 199 324 1601 TP 2s1/(2s1+s2)114 40 113 259 235 58 819 FP Ob L0 no elim

309 212 277 154 163 248 1363 TP 2s1/(2s1+s2) 22 40 65 211 196 27 561 FP Ob L0 234571

329 189 277 154 164 307 1420 TP 2s1/(2s1+s2) 25 1 113 211 121 33 504 FP Ob L0 347512

355 189 277 154 164 307 1446 TP 2s1/(2s1+s2) 37 18 14 259 121 33 482 FP Ob L0 425713

2 33 56 58 6 18 173 TP BandClass rule 0 0 24 46 0 193 263 FP mining (below)

G[0,46]2 G[47,64]5G[65,81]7 G[81,94]4G[94,255]{1,3}

R[0,48]{1,2}R[49,62]{1,5}R[82,255]3

ir1[0,88]{5,7} ir2[0,52]5

Conclusion? MeansMidPoint and Oblique std1/(std1+std2) are best with the Oblique version slightly better.

I wonder how these two methods would work on Netflix?

Two ways:

UTbl(User, M1,...,M17,770) (u,m); umTrainingTbl = SubUTbl(Support(m), Support(u), m)

MTbl(Movie, U1,...,U480189) (m,u); muTrainingTbl = SubMTbl(Support(u), Support(m), u)

Page 12: O 0 r 1 v r v3 r v2 v dim2 dim1 Anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers “A P-tree-based Outlier

Netflix data {mk}k=1..17770

uID rating date u i1 rmk,u dmk,u

ui2

. . .

ui n k

mk(u,r,d) avg:5655u/m

mID uID rating date m1 u 1 rm,u dm,u

m1 u2

. . .

m17770 u480189 r17770,480189 d17770,480189

or U2649429 --

----

-- 1

00,4

80,5

07 -

----

---

Main:(m,u,r,d) avg:209m/u

u1 uk u 480189 m1

: m h

:

m17770

rmhuk

47B

MTbl(mID,u1...u480189)

u0,2 u 480189,0 m1

: m h

:

m17770

0/1

47B

MPTreeSet 3*480189 bitslices wide

(u,m) to be predicted, from umTrainingTbl = SubUTbl(Support(m), Support(u),m)

Of course, the two supports won't be tight together like that but they are put that way for clarity.

Lots of 0s in vector sp, umTraningTbl). Want the largest subtable without zeros. How?

SubUTbl( nSup(u)mSup(n), Sup(u),m)?Using Coordinate-wise FAUST (not Oblique), in each coordinate, nSup(u), divide up all

users vSup(n)Sup(m) into their rating classes, rating(m,v). then:1. calculate the class means and stds. Sort means. 2. calculate gaps3. choose best gap and define cutpoint using stds.

This of course may be slow. How can we speed it up?

Coord FAUST, in each coord, vSup(m), divide up all movies nSup(v)Sup(u) to rating classes1. calculate the class means and stds. Sort means. 2. calculate gaps3. choose best gap and define cutpoint using stds.

Gaps alone not best (especially since the sum of the gaps is no more than 4 and there are 4 gaps). Weighting (correlation(m,n)-based) useful (higher the correlation the more significant the gap??)Ctpts constructed for just this one prediction, rating(u,m). Make sense to find all of them. Should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes?

m1 ... mh ... m17770 u1

: uk

.

.

.

u480189

rmhuk

47B

UserTable(uID,m1,...,m17770) m0,2 . . . m17769,0 u1

: uk

.

.

.

u480189

1/0

47B

UPTreeSet 3*17770 bitslices wide

(u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m)

u 324513?45

m

124

55

u 324513?45

m

124

55