21
1 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada [email protected] ia.ca Liaquat A. Khan NUST, Pakistan liaquatalikhan@gmail. com Benjamin C. M. Fung Concordia University Canada [email protected] .ca Mourad Debbabi Concordia University Canada [email protected] ia.ca

8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada [email protected] Liaquat

Embed Size (px)

Citation preview

Page 1: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

1

ACM SAC 2010

E-mail Authorship Verification for Forensic Investigation

Farkhund IqbalConcordia University

[email protected]

Liaquat A. Khan

NUST, Pakistan

[email protected]

Benjamin C. M. FungConcordia University

[email protected]

Mourad Debbabi Concordia University

[email protected]

Page 2: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Agenda

IntroductionMotivationProblem Definition

Related WorkProposed ApproachExperimental ResultsConclusion

2

Page 3: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Motivation

From Fingerprint to Wordprint/Writeprint

Style markers and structural traits, patterns of vocabulary usage, common grammatical and spelling mistakes

The approach is used in a number of courts in US, Australia, England (Court of Criminal Appeal), Ireland (Central Criminal Court), Northern Ireland, and Australia [H. Chen 2003].

Authorship Analysis

Attribution or identification

Verification or similarity detection

Characterization or profiling

3

Writeprint

Fingerprint

Written works

Page 4: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Authorship Analysis

Application domainHistoric authorial disputesPlagiarism detectionLegacy codeCyberforensic investigation

4

Page 5: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Motivation

Anonymity abuse cybercrimes Identity theft and masquerade Phishing and spamming Child pornography Drug trafficking Terrorism Infrastructure crimes: Denial of service attacks

5

Forensic analysis of e-mails with focus on authorship analysis for collecting evidence to prosecute the criminals in the court of law is one way to reduce cybercrimes [Teng 2004]

Page 6: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Online document

Content characteristics Short in size and limited in vocabularyInformal and interactive communicationSpelling and grammatical errorsSymbolic and para language

Large candidate set, more sample workAdditional information: time stamp, path,

attachment, structural features

6

Page 7: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Problem Definition7

To verify whether suspect S is or is not the author of a given malicious e-mail µ

Assumption #1: Investigator have access to previously written e-mails of suspect S

Assumption #2: have access to e-mails {E1,…,En}, collected from sample population U= {u1,…,un}

The task is to extract stylometric features

and develop two models: suspect model & cohort/universal background model (UBM)

classify e-mail µ using the two models

Anonymous e-mail µ

Suspect S

Sample population

Verified ?

Page 8: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Related Work

Similarity Detection [Abbasi and Chen 2008]Application to detect abuse of reputation system in online

marketplace (Ensemble SVM) Similarity detection for plagiarism detection [Van Halteren

2004]Two-class classification problem [Koppel et. al 2007 ]

Application to authorial disputes over literary works

8

Page 9: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

9

Proposed Approach

Page 10: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Features Extraction

Lexical (word/character based) features Word length, vocabulary richness, digit/caps distribution

Syntactic features (style marker) Punctuations and function words (‘of’ ‘anyone’ ‘to’)

Structural and layout features Sentence length, paragraph length, has a greetings/signature, types of

separators between paragraphs

Content specific features Domain specific key words, special characters

Idiosyncratic Features Spelling and grammatical mistakes

10

Page 11: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Model Development14

Model type Universal Background Model Cohort Model

Verification by classification Verification by regression Training & validation: 10-fold cross validation Model application

Classification score Regression score

Page 12: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Evaluation Metrics

Two types of error can occur during evaluation

False Positive

declaring innocent as guilty

False Negative

declaring guilty as innocent

DET (Detection Error Trade Off curve): Plotting False Positives vs False Negatives

Page 13: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Evaluation Metrics

Two types of evaluation metrics borrowed from speech processing community (NIST SRE)

Equal Error Rate the point on DET curve where the probabilities of false

alarm equals the probability of false rejection

Minimum Detection Cost Function

0.1 x False Rejection Rate + 0.99 x False Acceptance Rate

Page 14: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Experimental Evaluation17

Classifiers: AdaBoost DMNB Bayes Net

Classifiers implemented in WEKA [Witten, I.H. and Frank, E. ]

Page 15: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Experimental Evaluation18

Regression functions Linear Regression SVM- SMO Regression SVM with RBF

Regression functions implemented in WEKA [Witten, I.H. and Frank, E. 2005]

Page 16: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Comparative study

VerificationClassification Regression

Ada Boost BMNB SVM-SMO SVM-RBFEER(%) 22.4 20.1 19.4 22.3 19.3 17.1mDCF 0.0836 0.0858 0.0693 0.0921 0.0840 0.07

Bayes Net Lin. Regres

Values of EER and minDCF for different functions

Page 17: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Conclusion20

Application of classifiers and regression functions, and evaluation metric (NIST SRE)

EER of 17% by using real-life e-mails (Enron e-mail corpus)

EER 17% is not convincing in forensic investigation

Corpus issuesStylistic variation is hard to capture

Page 18: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

Features Contributions21

Lexical features such as vocabulary richness and word length distribution alone are not very effective only.

Combination of word based and syntactic features contribute significantly.

Structural features are extremely important in e-mailContent specific features are only effective in specific

applications. Idiosyncratic features needs a comprehensive thesaurus to be

maintained. Optimization of Features space

Page 19: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

References22

J. Burrows. An ocean where each kind: statistical analysis and some major determinants of literary style. Computers and the Humanities August 1989;23(4–5):309–21.

O. De Vel. Mining e-mail authorship. paper presented at the workshop on text mining. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2000.

I. Holmes. The evolution of stylometry in humanities. Literary and Linguistic Computing 1998;13(3):111–7.

F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation, 2008. Elsevier.

Page 20: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

References23

B.C.M. Fung, K. Wang, M. Ester. Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining (SDM); May 2003. p. 59–70

I. Holmes I, R.S. Forsyth. The federalist revisited: new directions in authorship attribution. Literary and Linguistic Computing 1995;10(2):111–27.

G.-F. Teng, M.-S. Lai, J.-B. Ma, and Y. Li. E-mail authorship mining based on SVM for computer forensic. In In Proc. of the 3rd International Conference on Machine Learning and Cyhemetics, Shanghai, China, August 2004.

J. Tweedie, R. H. Baayen. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 1998;32:323–52.

Page 21: 8 ACM SAC 2010 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat

References24

G. Yule. On sentence length as a statistical characteristic of style in prose. Biometrika 1938;30:363–90.

G. Yule. The statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press; 1944.

R. Zheng, J. Li, H.Chen, Z. Huang. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 2006;57(3):378–93.