Upload
elfinhe
View
316
Download
5
Embed Size (px)
DESCRIPTION
Citation preview
1/10
Naïve Bayesian Anti-Spam based on Shallow Syntactic Parsing
Software Engineering LaboratoryDepartment of Computer Science
Sun Yat-Sen UniversityApril 22, 2010
About 40 minutes
2/10
Outline
Background Naïve Bayesian Anti-Spam Shallow Syntactic Parsing My Approach Evaluation Future Work
3/10
Background
Junk E-mail ( Spam ) Wastes user time Fill-up file server storage space Influences company’s daily work
Definition Unwanted Commercial Depends on user’s measure
4/10
Naïve Bayesian Anti-Spam
Scan ParseData Sets Token Sets
Classifier
User Judge & Update
Judge
5/10
Naïve Bayesian Anti-Spam (cont')
A automated Method[1] To learn from data in a user’s mail
repositoryTo adapt to the changes over timeTo be personalized of one user’s mail
[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to filtering Junk E-mail[C]. In:Proc of the AAAI Workshop on Learning for Text Categorization,1998,pp.55-62,1998
6/10
Naïve Bayesian Anti-Spam (cont')
Hypothesis Independence assumption
[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to filtering Junk E-mail[C]. In:Proc of the AAAI Workshop on Learning for Text Categorization,1998,pp.55-62,1998
Naïve Bayesian Network A more complex Bayesian Network
7/10
Naïve Bayesian Anti-Spam (cont')
||
1
spamtspam
)|()(
)C|()()X|P(C C
ispamti
tspam
CXPCP
XPCP
n
iispam
n
iispam
tspam
XCPXCP
XPCP
11
spamn21spam
))|(1()|(
)C|()()X,...,X,X|P(C
Learning Procedure
Judge Procedure
8/10
法轮功 法律
法:1律:1
法:1轮:1功:1
A New Mail
P(法):0.5P(律):0.5
P(法):0.3P(轮):0.3P(功):0.3
SpamData Set
HamData Set
P(Spam | 法):0.3 / (0.3+0.5) = 0.375P(Spam | 轮):0.3 / (0.3+0) = 1P(Spam | 功):0.3 / (0.3+0) = 1P(Spam | 律):0 / (0+0.5) = 0
功律
P(Spam | 功, 律)=P(Spam | 功)*P(Spam | 律)/P(Spam | 功)*P(Spam | 律)+(1-P(Spam | 功))*(1-P(Spam | 律))=0
A Example
9/10
Shallow Syntactic Parsing
Syntactic Features [2] POS Chunk Dependency Relations [3] Predicate-argument Structure
Named Entities WordNet Senses Class-Specific Related Words
[2] X Li, D Roth, K Smal, The role of semantic information in learning question classifiers. Proceedings of the International Joint Conference, 2004[3] K Hacioglu, Semantic role labeling using dependency trees ,Proceedings of the 20th international conference on Computational Linguistics, 2004
10/10
Shallow Syntactic Parsing
A Example of Dependency Relations .
11/10
My Approach Motivation
To rise the precision( reduce error judge from ham to spam) General Parsing is inefficiency. Few attempts to study syntactic information in the context
of classification [2]
Domain Specific Properties Phrases (e.g., 出售发票 ) [1] Dependency Relations [3]
[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to filtering Junk E-mail[C]. In:Proc of the AAAI Workshop on Learning for Text Categorization,1998,pp.55-62,1998 [2] X Li, D Roth, K Smal, The role of semantic information in learning question classifiers. Proceedings of the International Joint Conference, 2004[3] K Hacioglu, Semantic role labeling using dependency trees ,Proceedings of the 20th international conference on Computational Linguistics, 2004
12/10
My Approach (cont')
Feature SpaceWordsPhrasesDependency Relations in sentence
13/10
My Approach (cont')读取一个词元
判断词元词性
保存特征(当前动词,最近名词)
保存特征(相邻动词,当前动词)
保存特征(相邻副词,当前动词)
保存特征(最近动词,当前名词)
保存特征(相邻形容词,当前名词)
如果是动词 如果是名词
保存特征(当前词)
其他
刷新对过去词的记录
读取下一个词
14/10
My Approach (cont') : A Example
Origin 现有一部分普通发票代开
Token 现 /tg 有 /vyou 一部分 /m 普通 /a 发票 /n 代开 /vt
Naïve Features ( 现 /tg),( 有 /vyou),( 一部分 /m), ( 普通 /a),( 发票 /n),( 代开 /vt)
Syntactic Features ( 有 /vyou, 发票 /n) , ( 普通 /a, 发票 /n) , ( 发票 /n, 代开 /vt)
15/10
Evaluation
Recall:
Precision:
Weighted Error Rate [4] :
[4] I Androutsopoulos, J Koutsias, KV Chandrinos. An Evaluation of Naïve Bayesian Anti-Spam Filtering. Proc. of the workshop, 2000
Fact Spam Fact Ham
Judge Spam A B
Judge Ham C D
BA
A
P
999,,WErr
whereNN
CB
sl
sN
A
CA
A
R
Ns=A+C , Nl=B+D
16/10
Evaluation (cont')
Fact Spam Fact Ham
Judge Spam A = 11647 B = 0
Judge Ham C = 3662 D = 4043
Fact Spam Fact Ham
Judge Spam A = 14378 B = 13
Judge Ham C = 930 D = 4042
Naïve Parsing:
Syntactic Parsing:
Syntactic Parsing Naïve Parsing
Recall 76.08% 93.9%
Precision 100% 99.91%
Weighted Error Rate 0.09% 0.34%
Time Cost ( s / 1000 mails )3.81 2.26
17/10
Evaluation (cont')
AdvantageHigher PrecisionAcceptable Speed Lower Weighted Error Rate
DisadvantageHigher Recall
Applicable BackgroundServer-side Anti-Spam
18/10
Future Work
Map-reduce Other Syntactic Features
19/10
Thank you!