72
William M. Pottenger, Ph.D. All Rights Reserved To be or not to be IID: That is the Question Higher Order Learning William M. Pottenger, Ph.D. Rutgers University and Intuidex, Inc. [email protected]; www.dimacs.rutgers.edu/~billp [email protected]; www.intuidex.com

Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

Embed Size (px)

DESCRIPTION

Much prior work has shown the practical value of modeling random variables as IID in order to simplify statistical inference, yet prior work has also shown this assumption to be suboptimal in terms of model performance. Occam’s razor prompts us to simplify explanations, and this talk will present how a very simple transform has been leveraged to improve performance of both generative and discriminative learners, as well as unsupervised learning, in a number of application domains including differentially private community discovery.

Citation preview

Page 1: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

To be or not to be IID: That is the Question

Higher Order Learning William M. Pottenger, Ph.D.

Rutgers University and Intuidex, Inc.

[email protected]; www.dimacs.rutgers.edu/~billp

[email protected]; www.intuidex.com

Page 2: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Dr. William M. Pottenger www.dimacs.rutgers.edu/~billp

www.intuidex.com

• Example Application Areas – Homeland Security/Law

Enforcement/Criminal Justice Information Systems

– Decision Support Systems – Information Retrieval Systems – High Performance Computing

• Research Funded by – National Science Foundation – National Institute of Justice – Department of Homeland Security – Army Research Lab – Commonwealth of Pennsylvania – Corporate Partners

– E.g., Lockheed-Martin, Kodak, PNNL, Boeing, etc.

• Associate Research Professor @ Rutgers University – DIMACS & Computer Science

• CEO of Intuidex, Inc. • Director of Transition for

DHS S&T CCI Center • Research Scientist @ NCSA • M.S., Ph.D. in CS at UIUC • Research Interests

– Statistical Relational Learning – Leveraging higher-order

relations in graphs of data – Parallel and Distributed Visual

& Data Analytics – Analytics in a parallel and/or

distributed environment – Information Extraction

– Automatic extraction of keywords/features from text

2

Page 3: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

What is Higher Order Information?

• Swanson (‘91) posed problem: Migraine headaches (M) – stress associated with M

– stress leads to loss of magnesium

– calcium channel blockers prevent some M

– magnesium is a natural calcium channel blocker

– spreading cortical depression (SCD) implicated in M

– high levels of magnesium inhibit SCD

– M patients have high platelet aggregability

– magnesium can suppress platelet aggregability

• All extracted from medical journal titles

Slide reused with permission of Marti Hearst @ UCB

3

Page 4: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Gathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesium magnesium

Slide reused with permission of Marti Hearst @ UCB

4

Page 5: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Higher Order Paths!

migraine magnesium

stress

CCB

PA

SCD

Slide reused with permission of Marti Hearst @ UCB

5

Page 6: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Related Work: Link Mining and Collective Classification Link-based approaches (Taskar et al., 2001; Getoor and

Diehl, 2005; Lu and Getoor, 2003; Neville and Jensen 2004) to collective classification use explicit link information within networked data

Studies (Chakrabarti et al., 1998; Neville and Jensen, 2000; Taskar et al., 2001) have shown that collective classifiers can achieve significant reductions in classification errors by performing inference about multiple data instances simultaneously

Collective classifiers are context-dependent and are not designed to classify stand-alone data instances

We propose classification methods that leverage implicit links between features in small training sets, and that maintain the ability for “context-free” classification of individual data instances

6

Page 7: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Is there a theoretical basis for the use of higher order co-occurrence relations?

• Research agenda: study machine learning algorithms in search of a theoretical foundation for the use of higher order relations

• First algorithm: Latent Semantic Indexing (LSI) – Widely used technique in text mining and IR based on

the Singular Value Decomposition (SVD) matrix factoring algorithm

– Research question: Does LSI use higher order term co-occurrence?

– First step: study SVD

7

April Kontostathis Associate Professor @ Ursinus College

Page 8: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Is there a theoretical basis for the use of higher order co-occurrence relations in LSI?

s1

s2

s3

sr

A (m x n)

T (m x r) S (r x r)

DT (r x n)

Term by Doc Term by

Dimension

Singular

Values

Dimension by Document

s1 <= s2 <= s3 <= . . . <=sr

r = rank of A, m = num terms, n = number docs

Singular Value Decomposition

8

Page 9: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Is there a theoretical basis for the use of higher order co-occurrence relations in LSI?

s1

s2

s3

sr

A (m x n)

T (m x k) S (k x k)

DT (k x n)

Reduced Term by Doc

Term by

Dimension

Singular

Values

Dimension by Document

s1 <= s2 <= s3 <= . . . <=sr

r = rank of A, m = num terms, n = number docs

LSI: Truncation of Singular Values

9

Page 10: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Is there a theoretical basis for the use of higher order co-occurrence relations in LSI?

hu

ma

n

inte

rfa

ce

co

mp

ute

r

use

r

syste

m

resp

on

se

tim

e

EP

S

Su

rve

y

tre

es

gra

ph

min

ors

human x 1 1 0 2 0 0 1 0 0 0 0

interface 1 x 1 1 1 0 0 1 0 0 0 0

computer 1 1 x 1 1 1 1 0 1 0 0 0

user 0 1 1 x 2 2 2 1 1 0 0 0

system 2 1 1 2 x 1 1 3 1 0 0 0

response 0 0 1 2 1 x 2 0 1 0 0 0

time 0 0 1 2 1 2 x 0 1 0 0 0

EPS 1 1 0 1 3 0 0 x 0 0 0 0

Survey 0 0 1 1 1 1 1 0 x 0 1 1

trees 0 0 0 0 0 0 0 0 0 x 2 1

graph 0 0 0 0 0 0 0 0 1 2 x 2

minors 0 0 0 0 0 0 0 0 1 1 2 x

Deerwester Term by Term Matrix

hu

ma

n

inte

rfa

ce

co

mp

ute

r

use

r

syste

m

resp

on

se

tim

e

EP

S

Su

rve

y

tre

es

gra

ph

min

ors

human x 0.54 0.56 0.94 1.69 0.58 0.58 0.84 0.32 -0.32 -0.34 -0.25

interface 0.54 x 0.52 0.87 1.50 0.55 0.55 0.73 0.35 -0.20 -0.19 -0.14

computer 0.56 0.52 x 1.09 1.67 0.75 0.75 0.77 0.63 0.15 0.27 0.20

user 0.94 0.87 1.09 x 2.79 1.25 1.25 1.28 1.04 0.23 0.42 0.31

system 1.69 1.50 1.67 2.79 x 1.81 1.81 2.30 1.20 -0.47 -0.39 -0.28

response 0.58 0.55 0.75 1.25 1.81 x 0.89 0.80 0.82 0.38 0.56 0.41

time 0.58 0.55 0.75 1.25 1.81 0.89 x 0.80 0.82 0.38 0.56 0.41

EPS 0.84 0.73 0.77 1.28 2.30 0.80 0.80 x 0.46 -0.41 -0.43 -0.31

Survey 0.32 0.35 0.63 1.04 1.20 0.82 0.82 0.46 x 0.88 1.17 0.85

trees -0.32 -0.20 0.15 0.23 -0.47 0.38 0.38 -0.41 0.88 x 1.96 1.43

graph -0.34 -0.19 0.27 0.42 -0.39 0.56 0.56 -0.43 1.17 1.96 x 1.81

minors -0.25 -0.14 0.20 0.31 -0.28 0.41 0.41 -0.31 0.85 1.43 1.81 x

Deerwester Term by Term Matrix, truncated to two dimensions

10

Page 11: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

• Answer is in the following theorem we proved: If the ijth element of the truncated term by term matrix, Y, is non-zero, then there exists a co-occurrence path of order 1 between terms i and j. – Kontostathis, A. and Pottenger, W. M. (2006) A

Framework for Understanding LSI Performance. Information Processing & Management, volume 42, issue 1, pages 56-73.

• We have both proven mathematically and demonstrated empirically that LSI is based on the use of higher order co-occurrence relations.

• Next step?

Is there a theoretical basis for the use of higher order co-occurrence relations in LSI?

11

Page 12: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Using Higher Order Information in both Generative and Discriminative Learning

• Extend the theoretical foundation that April and I developed by studying characteristics of higher-order information in other machine learning approaches including both generative and discriminative supervised learning as well as unsupervised approaches – Ganiz, M. C., Lytkin, N. I. and Pottenger, W. M.

(2009) Leveraging Higher Order Dependencies Between Features for Text Classification. In the Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Bled, Slovenia, September.

Nikita Lytkin

Research Scientist @

NYU Medical Center

Murat Ganiz Assistant Professor @ Dogus University

Page 13: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Representation of Boolean Data by a Bipartite Graph

13

Page 14: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Multinomial vs. Multivariate Event Model

McCallum & Nigam (1998)

14

Page 15: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

First Order Paths in a Data Graph

15

Page 16: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Second Order Paths in a Data Graph

16

Page 17: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Patterns of Connectivity between Features

17

Page 18: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Probabilistic Characterization of Features by Second Order Paths

18

Page 19: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Higher Order Naïve Bayes: A Generative Learner

Murat Ganiz Assistant Professor @ Dogus University

19

Page 20: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved 20

Slonim & Tishby (2001) vs. HONB

Ganiz, M. C., Pottenger, W. M. and George, C. (2010) Higher Order Naïve Bayes: A Novel Non-IID Approach to Text Classification. IEEE Transactions of Knowledge and Data Engineering (TKDE).

multinomial features binary features

Dataset NB NB_wc improvement % NB HONB improvement %

COMP (5) 0.473 0.508 7.4 0.51 0.65 26.5

SCIENCE (4) 0.65 0.725 11.5 0.6 0.84 41.6

POLITICS (3) 0.62 0.67 8.1 0.68 0.83 22.8

RELIGION (3) 0.525 0.553 5.3 0.64 0.74 15.7

8.075 26.65

HONB achieves statistically significantly better performance than NB for four datasets based on t-test results

(Slonim & Tishby, 2001) did not report std dev or t-test results

Page 21: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Supervised Second Order Transformation for Discriminative Learning

21

Nikita Lytkin Research

Scientist @ NYU Medical

Center

Page 22: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Influence of Higher-Order Paths

22

Page 23: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Experimental Setup

Support Vector Machine (Vapnik 1998) was used to evaluate the Supervised Second Order Transformation

Multi-class classification by SVM was performed using the “one-against-one” scheme

Used RBF and linear kernels in SVM and varied soft margin cost from 10-4 to 104

Training set size varied from 5% to 60% Eight experiments performed at each sample

size

25

Page 24: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Six benchmark text corpora were selected Stop words were removed, others were stemmed

For the RELIGION, POLITICS, SCIENCE and COMP

subsets of the 20 Newsgroups dataset, the top 2000 terms ranked by Information Gain were selected; 500 documents per class were sampled at random for comparison with Slonim and Tishby (2001)

Experimental Setup (continued)

Dataset # classes total # docs # terms

RELIGION 3 1500 2000

POLITICS 3 1500 2000

SCIENCE 4 2000 2000

COMP 5 2500 2000

Citeseer 6 3312 3703

Cora 6 2708 1433

26

Page 25: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Scalability Across Training Set Sizes

27

Page 26: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Results for Naïve Bayes, SVM, HONB and HOSVM on 20NG REL & SCI Datasets

28

Page 27: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Results for Naïve Bayes, SVM, HONB and HOSVM on Citeseer & Cora Datasets

29

Page 28: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Significance of Results for Naïve Bayes, SVM, HONB and HOSVM on All Datasets

30

HONB consistently and statistically significantly outperformed NB on all datasets (significant at <= 5% p-value)

HOSVM outperformed SVM on the RELIGION, POLITICS and SCIENCE datasets (significant at <= 5% p-value)

Although, the difference between HOSVM and SVM on the COMP dataset was significant at the level 0.158, HOSVM outperformed SVM on seven out of eight trials by an average of 3%

Page 29: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

What role do higher-order relations play in supervised machine learning?

• Higher-Order Collective Classification (HOCC) – Classifies a set of instances simultaneously and thus exploits the

relationships between them; Based on a record-relation graph

– Capable of both supervised event detection as well as unsupervised anomaly detection

• Application: Classification and Anomaly Detection of Interdomain Routing Events – Goal: detect and categorize such events

– Menon, V. and Pottenger, W. M. (2009) A Higher Order Collective Classifier for Detecting and Classifying Network Events. In the Proceedings of the IEEE International Conference on Intelligence and Security Informatics 2009 (ISI 2009)

31

Vikas Menon

Software Developer @

Bridgewater Associates

Page 30: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

HOCC Results

• Detection of Interdomain Routing Events and Anomalies Based on Higher-Order Path Analysis

– Slammer worm attack, Witty worm attack, 2003 East Coast Blackout

• Real Time Classification of Abnormal Events – Sliding window samples of 120 three-second instances

– 180th window = start of event

– HOCC detects events and distinguishes anomalies

Witty (Supervised) Witty (Unsupervised)

32

Page 31: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

What role do higher-order relations play in unsupervised machine learning?

• Next step? Consider unsupervised learning…

– Association Rule Mining (ARM)

• ARM is one of the most widely used algorithms in data mining

– Extend ARM to higher order… Higher Order Apriori

• LHOIM (Latent Higher-Order Information Mining)

• Experiments confirm the value of Higher Order Apriori on real world e-marketplace data

33

Shenzhi Li

Senior Software Engineer

@ Ask (Ask.com)

Page 32: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

LHOIM Results on 20NG Computer Dataset

• Average error rate for 1st-order (top left) 2nd-order (top right)

• Average stdev for 1st-order (bottom left) 2nd-order (bottom right)

34

Li, S. Z., Wu, T., and Pottenger, W. M. (2005) Distributed Higher Order Association Rule Mining Using Information Extracted from Textual Data. SIGKDD Explorations, volume 7, issue 1, pages 26-35.

Page 33: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

Higher Order Graph Sampling on Reuters

Naï…0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10

Naïve Bayes Random Sampling

Higher Order Naïve Bayes Random Sampling

Higher Order Naïve Bayes Higher Order Sampling

Naï…0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10

Naïve Bayes Random Sampling

Higher Order Naïve Bayes Random Sampling

Higher Order Naïve Bayes Higher Order Sampling

Naï…0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10

Naïve Bayes Random Sampling

Higher Order Naïve Bayes Random Sampling

Higher Order Naïve Bayes Higher Order Sampling

Naï…0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10

Naïve Bayes Random Sampling

Higher Order Naïve Bayes Random Sampling

Higher Order Naïve Bayes Higher Order Sampling

Higher Order Naïve Bayes with Higher

Order Sampling gives even better results

Higher Order Naïve Bayes improves the

accuracy by at least 10%

Accuracy in %

Patterns can be discovered using a

much smaller sample – important for online

learning

Training Sample %

Cibin

George

M.S. in

CS @

Rutgers

Page 34: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Higher Order (Online) Latent Dirichlet Allocation

Intuitively, this formula can be interpreted as a word being assigned to a topic proportional to its frequency of occurrence in that topic. This is in fact, our guiding intuition and we simply replace these term frequencies with higher order frequencies.

36

Nir Grinberg

Ph.D. in CS

@ Rutgers

Kashyap Kolipaka

Ph.D. in CS @

Rutgers

Christie Nelson

Ph.D. at RUTCOR

@ Rutgers

Page 35: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Modeling Social Media for Emergency Response in Port-au-Prince, Haiti

Cluster Geolocation

Page 36: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Modeling Social Media for Emergency Response in Port-au-Prince, Haiti

Cluster Geolocation with predicted resource

Page 37: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Research Futures: Privacy-Enhanced Higher Order Community Partitioning

),()(11

=,

11=

jiIPAnl

Q k

ij

k

ij

jik

l

k

l

),()(=),()2

(=,,

jiIPAjiIm

ddAQ ijij

ji

ji

ij

ji

Let I(I,j) be 1 if vertices i and j are in the same community (social network), and 0 otherwise, then Newman’s Q-Modularity is defined as:

Generalization

Q-Modularity counts edges inside each community and subtracts the expected number of edges inside the same community. Higher-order Ql counts number of paths inside each community and subtracts the expected number of paths. We propose Ql as a measure of a community split and consider a combinatorial optimization approach.

39

Alex Nikolov, Ph.D.

in CS @ Rutgers

Page 38: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Results on Ground Truth Data

• We optimized Ql using an LP rounding based

approximation algorithm for correlation clustering.

• We ran our experiments on networks with known communities, and compared the known communities to our clustering using the Adjusted Rand Index.

Dataset\l 1 2 3 4

Karate 0.5414 0.5669 0.5669 0.5669

Political Books

0.6250 0.6463 0.6463 0.6463

40

Page 39: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Is Ql easier to approximate?

• We approximated Ql on random Gn,p graphs for different values of l and p.

• We used the ratio of the value of the found solution to the value of an LP relaxation as an estimate of the approximation factor.

• It seems that Ql is harder for denser graphs (p high) but easier for higher l.

l = 1 2 3 4 5

p = 0.03 0.9678 0.9840 1.0000 1.0000 0.9986

p = 0.12 0.1828 0.4542 -0.1179 0.8447 1.0000

p = 0.60 -0.1130 0.3975 1.0000 1.0000 1.0000

41

Page 40: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Differential Privacy

• Differential Privacy [DMNS]: A randomized function K gives ε-differential privacy if for all graphs G1,G2 differing in a single edge and all subsets S of Range(K):

• The global sensitivity of a real valued function f is:

where G1,G2 differ in a single edge.

S])G([KPrS])G(K[Pr 21

GSf maxG1 ,G2 | f (G1) f (G2) |

42

Page 41: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Sensitivity of Ql

The global sensitivity of Ql is at most 5(2l – 1)/l for any fixed clustering.

By [DMNS], given a community split, outputting Ql + Lap(5(2l – 1)/lε) satisfies ε-differential privacy.

43

Page 42: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Differentially Private Community Discovery

• The measure of community split Ql is insensitive.

– We can output the value of a community split differentially privately

• But we would like a to design an algorithm Alg, such that:

– Alg outputs a community partition with high Ql ;

– Alg satisfies ε-differential privacy

• Considered in Differentially Private Combinatorial Optimization (Gupta et al. 2009), but there is no general method.

44

Page 43: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

In HOQL, we classify states as being in a high reward class or a low reward class. States are

added to a class based on a threshold. We use HONB classification for action selection. We

combine our method with greedy action selection based on the formula:

ε = 1- εstart

* (1-episodecurrent

/ episodetotal

)

Q-values are updated based on the traditional formula:

Q(st, a

t) ← Q(s

t, a

t) + α[r

t+1 + γmax

aQ(s

t+1, a) – Q(s

t, a

t)

Where α is the learning rate and γ is the discount factor. In these results, α = .91, γ =

1, and εstart

= 0.8

REU Ashley Edwards

Higher Order Q-Learning (HOQL)

Ashley

Edwards,

Applicant for

Ph.D. in CS

@ Rutgers

Edwards, A. and

Pottenger, W. M. 2011.

Higher Order Q-

Learning. IEEE

Symposium on Adaptive

Dynamic Programming

and Reinforcement

Learning. Paris, France.

45

Page 44: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Anomaly detection through machine-learning exposed that the Chinese government is capable of “line rate” MITM attacks. Due to pipelining in modern browser implementations, “censorware” is forced to remember a 5-tuple for every attempt a user makes to view censored content.

<ipSrc, ipDst, srcPort, dstPort, proto>

Chinese government routers use fiber-optics to do censorship at “line rate.”

They lose the ability to drop packets, so every censorware router in the path must store a 5-tuple and block responses.

This begs the question: “What kinds of computational complexity bottlenecks in ‘censorware’ can we exploit?”

For example, how large of a “botnet” would be required to cause Chinese censorware routers to run out of memory?

A B MITM

User attempts to restart the connection.

Government servers useSEQ-1460 attack on TCP.

Government servers get user to establish new, fake connection

User accepts new, fake connection and retransmits.

Government rejects data transmission with RST packet.

Server doesn’t understand new, fake connection. Sends RSTs.

User rejects attempt to restart the connection.

Server assumes user is adversarial. Sends RSTs and kills connection.

REU Becker Polverini Using Clustering to Detect Censorware

46

Polverini, A. B. and Pottenger, W. M. 2011. Using Clustering to Detect Chinese Censorware. CSIIRW ’11 Oak Ridge National Labs, TN USA

Page 45: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

CCICADA technology transfer efforts

• Goal: Technology transfer to DHS users and customers

• Several Tech Transfer programs @ DHS S&T: – E2E – Engage to Excel

– Tech Solutions

– SECURE

• CCICADA is committed to support these existing programs and to innovate new approaches – what can you do? – Publish your open-source software!

– Commercialize your software!

– Start your own company… and sell to DHS!

47 47

Page 46: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 48

Intuidex, Inc.

Presenter: William M. Pottenger, Ph.D. [email protected]

Page 47: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 49

About Intuidex Data Analytics and Data Model provider Focused on helping Organizations discover

actionable intelligence from large, varied, and complex data sources

Provides an open, extensible analytics platform, Watchman AnalyticsTM

Platform and components that facilitate enhanced real-time information extraction, consolidation, fusion and discovery from disparate structured and unstructured data streams

Page 48: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 50

The problem we solve: “Big Data”

Data volume and complexity has increased exponentially The number of data sources has exploded as well as

data formats, schemas and types The most valuable data is often unstructured and

fragmented The necessary data to drive better decisions is often

scattered across multiple data silos Data that is useful and valuable is often incomplete and

requires other data sources to validate Data storage systems are often proprietary with limited

interoperability Data from different sources regarding the same entities

sometimes conflicts.

Page 49: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 51 www.intuidex.com ©Intuidex 2013 51

Differentiation

• Academic: Commercial Technology Development o Lab @ Rutgers University o Director of Tech Transition for DHS S&T CCI Center o Close cooperation with Rutgers Office of

Commercialization o Three patents allowed, fourth pending

• Strategic Partnerships o Rutgers University and DHS S&T Center of Excellence o PNNL-DHS S&T National Visual Analytics Center o Law Enforcement Partners: 3M (PIPS Technology) o Customers in Intel / Defense sectors

Page 50: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 52

Analyst Information Overload

FMV

COMINT

SIGINT

HUMINT

SIGACTS

OTHER

Analyst

Applications and

Visualization Platforms e.g., TIGR

Page 51: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 53

Data Source

Data Source

Data Source

Data Source

Hig

h P

erf

orm

ance

Ind

ex (

IxH

PI™

)

Indexing

Routine

Indexing

Routine

Indexing

Routine

Indexing

Routine

Watchman Analytics™

Entity Extraction (IxExtract™)

Feature Selection (IxFeatures™

Topic Modeling (IxTopics™)

Rule Learning (IxRules™)

Recommender (IxRecommend™)

Alerting (IxAlert™)

Clustering (IxCluster™)

Data Validation (IxValidate™)

Trending (IxEntityTrend™)

Link Analysis (IxLinks™)

Data Fusion (IxRelClu™)

Entity Resolution (IxResolve™)

U

S

E

R

Watchman Analytics™ Visualization

Customer Visualization

Page 52: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 54 www.intuidex.com ©Intuidex 2013 54

• Web-based advanced data analytics and visualization solution

• Adobe Flex RIA framework

• Component Modules

• Synchronized

Watchman Analytics™ for BOSS

Page 53: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 55

Intuidex and 3M Partnership

Intuidex, Inc., a leader and innovator in data analytics (machine learning), is the

pioneer of Higher Order Learning™ technologies that deliver unprecedented accuracy and

efficiency in identifying linkages, trends and patterns across disparate information

systems, in real time or near real time. Intuidex analytics have been licensed by

customers in the US Defense and Intelligence Agencies, US Law Enforcement Agencies

and the Fortune 500 to extract latent intelligence and insights from both structured and

unstructured data sources.

3M (formerly PIPS Technology) is the worldwide leader in Automated License

Plate Recognition (ALPR) technology. PIPS designs, manufactures, and supports its

complete line of ALPR products and services for use in law enforcement, parking, tolling,

and intelligent transportation systems. With over 20,000 cameras deployed around the

globe and a wide range of patents covering their technology and its application, PIPS

Technology is easily recognized as the leading provider of traffic related video imaging

and license plate capture technology for public safety agencies everywhere.

Page 54: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 56

APPLICATIONS OF HIGHER ORDER LEARNING™

FROM

Page 55: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 57

• Objective: determine which COMINT is likely important and require further analysis

• Data: plain text representation of comm-hits

• 400 samples drawn from Afghanistan theater

• Classification: two classes

• Class A, Class B

• Evaluation

• Compared IxHONB™ to Naïve Bayes (NB)

• Train on 5% to 90%, test on rest

• Averages (accuracy, precision, recall, ...) across 10-folds

Military Threat Detection Applications of Intuidex’s Higher Order Learning™

Page 56: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 58 www.intuidex.com ©Intuidex 2013 58

Weighted F-measure performance of NB vs. IxHONB™

Page 57: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 59

MIRC (Chat) Entity Extraction Data from MIRC chat Comm Hits (COMINT) has

been helpful to GMTI analysts in Determining the nature of movements detected by radar (e.g.,

wild animals don't radio their friends for help) Whether ground targets may represent a threat Validating known movements by corroborating with statements

of locals (if they see a vehicle WE see, then we KNOW what the “dots” are)

Some “dots” can talk!

Tactical Ground Reporting System (TIGR) A TIGR user on the battlefield has limited ability to refine a

search the way an analyst can Only has temporal and spatial filters, and relies on pre-

packaged intel from various sources input to TIGR (HUMINT, SIGACT, HUMINT)

Page 58: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 60

Example Actionable Information • IxRules™ aids a user in discovering rules for multiple entity types

• IED Trigger “On 23 February 2006, at 12:30 PM, in Ba'qubah, Diyala, Iraq, assailants detonated a probable command-initiated improvised explosive device (IED) hidden in a soup vendor's handcart near an Iraqi Army patrol in the central market, killing eight Iraqi soldiers and eight civilians, wounding four Iraqi soldiers and 11 civilians, and causing unspecified damage to the public market. The Mujahidin Shura Council in Iraq (MSC) claimed responsibility.”

• Height “… The suspect is described as black, medium complexion, 28-30 years old, clean-shaven, approximately 6 feet 8 inches tall, weighing 180-200 pounds, with a muscular build. He was last seen wearing a black sweatshirt, black pants, and a dark blue or black knit hat. …”

Page 59: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 61

Tactical Ground Reporting System: TIGR

Page 60: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 62

Benefits to the Warfighter

1. Fusion of high-value COMINT intel provides

significantly improved situational awareness for

warfighters with ‘boots on the ground’

2. Extraction / summarization of high-value COMINT,

SIGACT, HUMINT from unstructured, unleveraged

text sources

3. Fusion of high-value COMINT and other text-

based intel with GMTI and other intel sources

• Transitioned to: ESC/CIEF, used in DARPA

Tactical Ground Reporting System (TIGR)

Technology Transition Description

• Fielded operationally at: Afghanistan and other

theaters

• Customer(s): TIGR and users, e.g., GEOINT, FSR,

S2, ISR, MAI, CPTI, JIEDDO MID, CIED, RFI, NASIC,

Centcom TFs

Information extraction, summarization and fusion technologies to provide warfighter with

situational awareness

From theater: “These are exactly the sort of quick and

dirty SIGINT summaries I am trying to get. … Just

wanted to make sure you know how happy our ground

units are to get this information in a wrap up. This daily

tipper has made our supported units very happy. Thanks

for the consistent help.”

Page 61: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 63

• Objective: Classify confidence in perpetrator identification for incidents in NCTC Worldwide Incident Tracking System (WITS)

• Data: relational tables from WITS

• Sampled ~1,000 incidents from 80,000 record corpus

• Included some free text

• Classification: five confidence classes

• Plausible, Likely, Unknown, Unlikely, Inferred (analyst)

• Evaluation

• Compared IxHONB™ to NB and LSI-kNN

• Train on 5% to 90% of sample, test on rest

• Averages (accuracy, precision, recall, ...) across 10-folds

Counterterrorism Applications of Intuidex’s Higher Order Learning™

Page 62: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 64

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 10 20 30 40 50 60 70 80 90

F-m

eas

ure

Percentage of Training Set Available for Training

HONB

LSI-kNN

NB

Non-weighted F-measure performance of NB, LSI-kNN and IxHONB™

Page 63: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 65

Nuclear Detection •Data was taken from a Thermo Scientific handheld Spectroscopic Personal Radiation Detector called the InterceptorTM

• 302 gamma-ray spectrum files •20 from Tc99m, the rest from other isotopes or background •Small positive class size

• 1024 numeric channels per spectrum •High dimensional space

• 14 labeled, high confidence isotopes •Potassium (40K; 1.3 billion years)

Page 64: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 66

Sample of Results - Accuracy Accuracy

65% 60% 55% 50% 45% 40% 35% 30% 25% 20%

Ga67 – D-B 0.0002 0 0 0 0 0 0 0 0 0

Ga67 – N-D-B 1 1 1 0.343 0.778 0.697 0.39 0.57 0.26 0.06

I131 – D-B 0 0 0 0 0 0 0 0.01 0.251 0.16

I131 – N-D-B 0 0 0.002 0.008 0 0 0 0.01 0.002 0

In111 – D-B 0.136 0.017 0.01 0.001 0 0 0 0 0 0

In111 – N-D-B 1 0.08 0.389 0.005 0.001 0.037 0 0.18 0 0.45

Tc99m – D-B 0.049 0.095 0.001 0 0 0 0 0 0 0

Tc99m - N-D-B 0 0 0 0 0 0 0 0 0 0

Key

Statistically Significant difference: NB < HONB

Not Statistically Significant

Page 65: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 67 www.intuidex.com ©Intuidex 2013 67

Typical Intuidex Engagement

• Client environment analysis Infrastructure (hardware, software) Data sources Operations (relevant and related policies)

• Requirements Specification with SMEs Iterate until approved

• Deploy high-performance index engine Install, configure, test

• Deploy indexing routines Develop, configure, optimize

• Deploy analytics services (Optional) Develop custom services to spec Install, configure, test

Page 66: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 68 www.intuidex.com ©Intuidex 2013 68

Typical Intuidex Engagement

• (Optional) Existing visualization interface Design interface specification for existing framework

• Ground-truth development with SMEs • System documentation

Usage documentation Administration and Configuration documentation Visualization interface documentation (optional)

• Deployment validation Quality assurance Load testing

• Customer acceptance

Page 67: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 69 www.intuidex.com ©Intuidex 2013 69

Watchman Analytics™ Functionality

Entity Resolution

Online Monitoring

Data Deconfliction

Automated Alerting

Interactive

Analysis

Entity Extraction

Ad-hoc Reporting

Entity Classification

Privacy

Protection*

Quality Assurance

Link-based

Analysis

Embedded Analytics

* Privacy protection is a major Intuidex research area and development thrust

Page 68: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

www.intuidex.com ©Intuidex 2013 71

• Intuidex, Inc. is a hi-tech start-up incorporated by

William. M. Pottenger, Ph.D.

• Thought Leadership in Data Analytics

• Key Partnerships

7

1

Page 69: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Acknowledgements

• I am very grateful to my hardworking, intelligent and creative (current and former) students and postdocs without whom none of this would have been possible: Kunikazu Yoda, Christie Nelson, Aleksandar Nikolov, Nir Grinberg, Cibin George, Christopher Janneck, Nikita Lytkin, Shenzhi Li, Murat Ganiz, Chirag Pandya, Kashyap Kolipaka, Vikas Menon, April Kontostathis, Tianhao Wu, Jirada Kuntraruk, Jason Perry, Mark Dilsizian (and >> others).

• I also thank Rutgers University, the National Science Foundation, the Department of Homeland Security and the National Institute of Justice. This material is based upon work partially supported by the National Science Foundation under Grant Numbers 0703698 and 0712139. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or Rutgers University.

• I also gratefully acknowledge the continuing help of my Lord and Savior, Yeshua the Messiah (Jesus the Christ) in my life and work.

72

Page 70: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Thank you!

Q&A

73

Page 71: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

References

Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. SIGMOD Rec., 27(2):307–318, 1998.

Scott Deerwester, Susan T. Dumais, George W. Furnas,Thomas K. Landauer, and Richard Harshman.

Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391–407, 1990.

Lise Getoor and Christopher P. Diehl. Link mining: a survey. SIGKDD Explor. Newsl., 7(2):3–12, 2005.

Murat Can Ganiz, Sudhan Kanitkar, Mooi Choo Chuah, and William M. Pottenger. Detection of interdomain routing anomalies based on higher-order path analysis. In ICDM ’06: Proceedings of the Sixth International Conference on Data Mining, pages 874–879, Washington, DC, USA, 2006. IEEE Computer Society.

Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, March 1953.

April Kontostathis and William M. Pottenger. A framework for understanding latent semantic indexing (LSI) Performance. Inf. Process. Manage., 42(1):56–73, 2006.

74

Page 72: Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learning Group)

William M. Pottenger, Ph.D.

All Rights Reserved

Qing Lu and Lise Getoor. Link-based classification. In Tom Fawcett and Nina Mishra, editors, ICML, pages 496–503. AAAI Press, 2003.

Shenzhi Li, Tianhao Wu, and William M. Pottenger. Distributed higher order association rule mining using information extracted from textual data. SIGKDD Explorations Newsl., 7(1):26–35, 2005.

J. Neville and D. Jensen. Iterative classification in relational data. In Proc. AAAI, pages 13–20. AAAI Press, 2000.

J. Neville and D. Jensen. Dependency networks for relational data. Data Mining, 2004. ICDM ’04. Fourth IEEE International Conference, pages 170–177, Nov. 2004.

Noam Slonim and Naftali Tishby. The power of word clusters for text classification. In In 23rd European Colloquium on Information Retrieval Research, 2001.

Ben Taskar, Eran Segal, and Daphne Koller. Probabilistic classification and clustering in relational data. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 870–878, 2001.

Vladimir Vapnik. Statistical Learning Theory. John Wiley, 1998.

References

75