1 Monitoring Message Streams: Retrospective and Prospective Event Detection Fred S. Roberts DIMACS, Rutgers University

1

Monitoring Message Streams: Retrospective and Prospective

Event Detection

Fred S. RobertsDIMACS, Rutgers University

2

DIMACS is a partnership of:

•Rutgers University

•Princeton University

•AT&T Labs

•Bell Labs

•NEC Research Institute

•Telcordia Technologies

http:dimacs.rutgers.edu

[email protected]

732-445-5928

3

Motivation:

monitoring of global satellite communications (though this may produce voice rather than text)

sniffing and monitoring email traffic

OBJECTIVE:

Monitor streams of textualized communication to detect pattern changes and "significant" events

4

• Given stream of text in any language.

• Decide whether "new events" are present in the flow of messages.

• Event: new topic or topic with unusual level of activity.

• Retrospective or “Supervised” Event Identification: Classification into pre-existing classes.

TECHNICAL PROBLEM:

5

More Complex Problem: Prospective Detection or “Unsupervised” Learning

1) Classes change - new classes or change meaning

2) A difficult problem in statistics

3) Recent new C.S. approaches

4) Algorithm detects a new class

5) Human analyst labels it; determines its significance

6

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING

(1). Compression of Text -- to meet storage and processing limitations;

(2). Representation of Text -- put in form amenable to computation and statistical analysis;

(3). Matching Scheme -- computing similarity between documents;

(4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”)

(5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.

7

Best results to date:

Retrospective Detection: David Lewis (2001), using simple Support Vector Machines

Prospective Detection: results reported by a group from Oracle (2001), using change of basis representation, which builds on natural language knowledge

STATE OF THE ART:

8

• Existing methods use some or all 5 automatic processing components, but don’t exploit the full power of the components and/or an understanding of how to apply them to text data.

• Lewis' methods used an off-the-shelf support vector machine supervised learner, but tuned it for frequency properties of the data.

• The combination dominated competing approaches in the TREC-2001 batch filtering evaluation.

WHY WE CAN DO BETTER:

9

• Existing methods aim at fitting into available computational resources without paying attention to upfront data compression.

• We hope to do better by a combination of:

more sophisticated statistical methods

sophisticated data compression in a pre-processing stage

Alta Vista: combining data compression with naïve statistical methods leads to some success

WHY WE CAN DO BETTER II:

10

COMPRESSION:

• Reduce the dimension before statistical analysis.

• Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. (E.g.: use random projections.)

• Unlike feature-extracting dimension reduction, which can lead to bad results.

We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach.

11

• Representations: Boolean representations; weighting schemes

• Matching Schemes: Boolean matching; nonlinear transforms of individual feature values

• Learning Methods: new kernel-based methods (nonlinear classification); more complex Bayes classifiers to assign objects to highest probability class

• Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes

MORE SOPHISTICATED STATISTICAL APPROACHES:

•.

12

• Identify best combination of newer methods through careful exploration of variety of tools.

• Address issues of effectiveness (how well task is done) and efficiency (in computational time and space)

• Use combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives.

THE APPROACH•.

13

• Extend work to unsupervised learning.

• Still concentrate on new methods for the 5 components.

• Emphasize “semi-supervised learning” - human analysts help to focus on features most indicative of anomaly or change; algorithms assess incoming documents as to deviation on those features.

• Develop new techniques to represent data to highlight significant deviation:

Through an appropriately defined metric

With new clustering algorithms

Building on analyst-designated features

IN LATER YEARS•.

14

Strong team: statisticians, computer scientists, experts in info. retrieval & library science

DAVID MADIGAN, Rutgers Statistics:

NSF project on text classification.

An expert on Bayes classifiers.

developing extensions beyond Bayes classifiers

(Lewis is his co-PI and a subcontractor on his NSF grant.)

DIMACS STRENGTHS:

15

Expert on combining multiple methods for classifying candidate documents.

Expert on information retrieval and interactive systems -- human input to leverage filtering and processing capabilities.

PAUL KANTOR, Rutgers, Library Information Science and Operations Research:

DAVID LEWIS, Private Consultant:

Best basic batch filtering methods.

Extensive experience in text classification.

16

ILYA MUCHNIK, Rutgers Computer Science:

Developed a fast statistical clustering algorithm that can deal with millions of cases in reasonable time.

Pioneered use of kernel methods for machine learning.

Developed algorithms for making one pass over text documents to gain information about them.

MUTHU MUTHUKRISHNAN, Rutgers Computer Science:

17

MARTIN STRAUSS, AT&T Labs

Has new methods for handling data streams whose items are read once, then discarded.

RAFAIL OSTROVSKY, Telcordia Technologies

Developed dimension reduction methods in the hypercube.

Used these in powerful algorithms to detect patterns in streams of data.

18

ENDRE BOROS, Rutgers Operations Research

Developed extremely useful methods for Boolean representation and rule learning.

FRED ROBERTS, Rutgers Mathematics and DIMACS

Developed methods for combining scores in software and hardware testing.

Long-standing expertise in decision making and the social sciences.

19

DAVID GOLDSCHMIDT, Director, Institute for Defense Analyses - Center for Communications Research

advisory role

long-standing partnership between IDA-CCR and DIMACS

will sponsor and co-organize tutorial and workshop on state-of-the-art in data mining and homeland security to kick off the project

20

• Prepare available corpora of data on which to uniformly test different combinations of methods

• Concentrate on supervised learning and detection

• Systematically explore & compare combinations of compression schemes, representations, matching schemes, learning methods, and fusion schemes

• Test combinations of methods on common data sets and exchange information among the team

• Develop and test promising dimension reduction methods

S.O.W: FIRST 12 MONTHS:

21

• Combine leading methods for supervised learning with promising upfront dimension reduction methods

• Develop research quality code for the leading identified methods for supervised learning

• Develop the extension to unsupervised learning :

Detect suspicious message clusters before an event has occurred

Use generalized stress measures indicating a significant group of interrelated messages don’t fit into the known family of clusters

Concentrate on semi-supervised learning.

S.O.W: YEARS 2 AND 3:

22

12 MONTHS:

• We will have established a state-of-the art scheme for classification of accumulated documents in relation to known tasks/targets/themes and building profiles to track future relevant messages.

• We are optimistic that by end-to-end experimentation, we will discover synergies between new mathematical and statistical methods for addressing each of the component tasks and thus achieve significant improvements in performance on accepted measures that could not be achieved by piecemeal study of one or two component tasks.

IMPACT:

23

3 YEARS:

• We will have produced prototype code for testing the concepts and a rigorously precise expression of the ideas for translation into a commercial or government system.

• We will have extended our analysis to semi-supervised discovery of potentially interesting clusters of documents.

• This should allow us to identify potentially threatening events in time for cognizant agencies to prevent them from occurring.

IMPACT:

Documents

1 Monitoring Message Streams: Retrospective and Prospective Event Detection Fred S. Roberts DIMACS, Rutgers University