25
Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指指指指 指指指 指指 指指 指指指 R16031145 指指指 R16034193 Jan 9, 2015

Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Embed Size (px)

Citation preview

Page 1: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques

指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193

Jan 9, 2015

Page 2: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Briefly speaking…

Data mining

電腦審計部門

Support誰對財報舞弊的預測能力最強 ?

Fraud costs US business more than $400 billion annually. (Kirkos et al.)

1. Introduction

Page 3: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

2. Literature Review

decision trees

neural networks (NN)

Bayesian belief networks

case based reasoning

fuzzy rule-based reasoning

hybrid methods

Aamodt and Plaza [1] and Kotsiantis et al.

Further, Deshmukh and Talluru

Calderon and Cheh

support vector machines

Cecchini et al. Huang et al.

Sohl and Venkatachalam

Cerullo and Cerullo

Koskivaara

Feroz et al.

Aamodt and Plaza and Kotsiantis et al.

Pacheco et al.

Page 4: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

3. Methodology

• Dataset 35 financial items of 202

companies from China

(101 were fraudulent

101 were non-fraudulent companies)

• Dataset Processing

log transformation

ten-fold cross validation

→improve the reliability

of the result.

Page 5: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

• Without feature selection

• With feature selection

Page 6: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Support Vector Machines (SVM)---支持向量機器

• The training examples that are closest to the maximum margin hyperplane are called support vectors.

• All other training examples are irrelevant for defining the binary class boundaries.

• financial applications : credit rating, time series prediction, and insurance claim frauds detection.

Page 7: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Genetic Programming (GP)---基因規劃

• an extension of genetic algorithms (GA).

• evolutionary computation→ reproduction, crossover, mutation

→ find the best solution to the problem

Page 8: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Multi-Layer Feedforward Neural Network (MLFF)---多層式前饋網路

• MLFF is one of the most common NN structures, as they are simple and effective, and have found home in a wide assortment of machine learning applications.

Page 9: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Group Method Data Handling (GMDH) --- 資料處理群集分析

• an inductive learning algorithm for modeling complex systems.

• It is a self-organizing approach that tests increasingly complicated models and

evaluates them using some external

criterion on separate parts of the data

sample.

Page 10: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Logistic Regression(LR)---羅吉斯回歸

• The response variable is taken to be dichotomous or binary, i.e., yi=0 or 1 for all i=1,...,n.

• LR is a commonly used approach for performing binary classification.

Page 11: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Probabilistic Neural Network(PNN)---機率神經網路

• It is a pattern classification network based on the classical Bayes classifier, which is statistically an optimal classifier that seeks to minimize the risk of misclassification.

Input layer

Pattern layer

Summation layer

output layer

Page 12: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

4. Feature Selection

• In this study, we employed a feature

selection phase by using the simple t-statistic

technique.

μ1 and μ2 represent the means of the samples of fraudulent companies and non-fraudulent companies for a given feature respectively, σ1 and σ2 represent the standard deviation of the samples of fraudulent companies and non-fraudulent companies for a given feature respectively. n1 and n2 represent the number of samples of fraudulent companies and non-fraudulent companies for a given feature.

(For given features)

Page 13: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

• The t-statistic values are computed for each feature

and the top 18 features with the highest t-statistic

are identified.

• A high t-statistic value: highly discriminate between

fraudulent and non-fraudulent companies.

Page 14: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Top 18 items selected by t-statistic basedfeature selection.

Page 15: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

5. Results and Discussion

35 financial items for

202 companies

withFeature selection

withoutFeature selection

TOP 18

TOP 10

1.

2.

3.

Which classifier is best?

Page 16: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Result 1: without feature selection

Page 17: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Explanations• The sensitivity is the measure of the proportion

of the number of fraudulent companies predicted correctly as fraudulent by a particular model to the total number of actual fraudulent companies. (true positive rate)

• The specificity is the measure of the proportion of the number of nonfraudulent companies predicted as non-fraudulent by a model to the total number of actual non-fraudulent companies. (true negative rate)

Page 18: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Result 2: Top18 by feature selection

Results based on AUC indicated that GP yielded highest accuracy followed by PNN, which yielded marginally less accuracy. This makes us infer that the selected feature subsets have a high discriminatory power and the ‘left-over’ features havevery little to contribute to the success of financial fraud detection.

Page 19: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Result 3: Top10 by feature selection

Page 20: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

PNN

WithoutFeature selection

WithFeature selection

TOP 18 TOP 10

PNN GP

Remaining Classifiers

T-test : differences?

Page 21: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Result 4: PNN vs. other classifiers(without feature selection)

> Critical value 1.73

✓✓✓✓✓

PNN is the best of the all

In order to find out whether the difference in average AUCs isstatistically significant or not, we conducted a t-test between the topperformer and the remaining classifiers

Page 22: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Result 5: GP vs. other classifiers(Top18 by feature selection)

✓✓✓

PNN is almost as good as GP

Page 23: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Result 6: PNN vs. other classifiers(Top10 by feature selection)

✓✓

PNN, GP, GMDH are good

Page 24: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

Conclusion and future research directions

Page 25: Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

The End