25
Data Mining of E- Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of Hawai`i at Mānoa 5th Symposium on Information Systems Assurance 5th Symposium on Information Systems Assurance Toronto: October 2007

Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Embed Size (px)

Citation preview

Page 1: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Data Mining of E-Mails to Support Periodic & Continuous Assurance

Glen L. GrayCalifornia State University at Northridge

Roger DebrecenyUniversity of Hawai`i at Mānoa

5th Symposium on Information Systems Assurance5th Symposium on Information Systems Assurance

Toronto: October 2007

Page 2: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

In this Presentation

Continuous monitoring of emails – why? Technologies

Social Network Analysis Text analysis

Challenges Opportunities

Page 3: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Continuous Monitoring of Emails – Why?

Increased focus on forensic approaches to auditing

Increased interest in continuous assurance and monitoring of business processes

Emails = Organization’s DNA Evidential matter on:

Employee & management fraud (overrides) Compliance (e.g., HIPAA) Loss of intellectual property Corporate policies

Page 4: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Enron Email Archive

Released by Federal Energy Regulatory Commission

500K emails 151 Enron employees Cleaned version at Carnegie Mellon

www.cs.cmu.edu/~enron/ Relational DB version at USC

www.isi.edu/~adibi/Enron/Enron_Dataset_Report.pdf

Page 5: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Email Mining Targets

EmailData Mining

Key WordQueries

DeceptionClues

Volume &Velocity

Social NetworkAnalysis

ContentAnalysis

LogAnalysis

Page 6: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Content Analysis

Page 7: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Key Word Queries

Yes, people do say self-incriminating things in their emails Fraud Corporate dysfunction

Overwhelming false positives Need “smart” compound queries Good continuous auditing (CA) candidate

Already scanning for spam, porn, etc.

Page 8: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Sender Deception -- Content

Deceptive emails include: Fewer first-person pronouns to dissociate

themselves from their own words Fewer exclusive words, such as but and

except, to indicate a less complex story More negative emotion words because of the

sender’s underlying feeling of guilt More action verbs to, again, indicate a less

complex story

Page 9: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Sender Deception -- Identification

Writeprint features Lexical -- characters & words

Function words Root words

Syntactic -- sentences Structural -- paragraphs Content-specific

Page 10: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Sender Deception -- Identification

Number of potential features unlimited Optimum number can vary by

context and language Developing user profiles and comparing new

emails to profiles would be challenging for real-time CA

Page 11: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Temporal/Log Analysis

Page 12: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Volume & Velocity

Volume = number of emails a person sends and/or receives over a period of time.

Velocity = how quickly the volume changes. Many external factors (e.g., vacations, seasonal

activities, etc.) impact these numbers Need “rolling histogram”

Page 13: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Volume & Velocity

Key issue -- determining the optimum time intervals to sample the data

Continuous monitoring cannot be continuous in terms of sampling in real time

Comparing hourly, daily, and even weekly volumes and velocities will result in many false positives

Optimum time internal could vary by job title

Page 14: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Social Network Analysis

Page 15: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Social Network Analysis

Social relationships as an undirected graph Importance of understanding relationships

within the flow of email exchanges

Page 16: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Social Network Analysis in Emails

Emails semi-structured data sender primary recipient(s) copied recipient(s) date subject line

Social groups and cliques CA = who doesn’t belong?

Page 17: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Thread Analysis – This?Time

S R

C

C

SR

C

C

R

C

C

S

S

R

C

C

Page 18: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Thread Analysis – Or this?Time

S

R

C

C

S

R

R

C

S

C

R

R

S

R

Page 19: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Integrating Content Analysis and Social Network Analysis

EmailData Mining

Key WordQueries

DeceptionClues

Volume &Velocity

Social NetworkAnalysis

ContentAnalysis

LogAnalysis

Page 20: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Challenges of Email Mining

Textual Inconsistent use of abbreviations Misspelled words Smileys etc. etc. Replies, replies, and more replies…

Inability to identify: Identities of email participants

[email protected] Roles and responsibilities

Page 21: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

What Enron Emails Show?

People do say the darnest things What did he know and when did he know it? Verified numerous bodies of email data

mining research Content analysis Social network analysis

Page 22: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Tools

Content monitoring eSoft Corporation’s ThreatWall Symantec’s Mail Security 8x00 Series Vericept Corporation’s Vericept Content 360º Reconnex Corporation’s iGuard Appliance InBoxer, Inc. Anti-Risk Appliance

Social networks Microsoft SNARF Heer Vizter

Page 23: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Research Opportunities

Page 24: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Research Questions

Role of email monitoring in overall CA environment?

Join SNA with examination of textual patterns. Link SNA with control environment Frauds/control overrides footprint? What email cleaning is required for CA purposes? Privacy and policy issues? Lessons from existing commercial products?

Page 25: Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of

Your Questions

Thank You

[email protected]

[email protected]