19
1/10 Naïve Bayesian Anti- Spam based on Shallow Syntactic Parsing Tao He [email protected] Software Engineering Laboratory Department of Computer Science Sun Yat-Sen University April 22, 2010 About 40 minutes

Semantic Parsing in Bayesian Anti Spam

  • Upload
    elfinhe

  • View
    316

  • Download
    5

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Semantic Parsing in Bayesian Anti Spam

1/10

Naïve Bayesian Anti-Spam based on Shallow Syntactic Parsing

Tao [email protected]

Software Engineering LaboratoryDepartment of Computer Science

Sun Yat-Sen UniversityApril 22, 2010

About 40 minutes

Page 2: Semantic Parsing in Bayesian Anti Spam

2/10

Outline

Background Naïve Bayesian Anti-Spam Shallow Syntactic Parsing My Approach Evaluation Future Work

Page 3: Semantic Parsing in Bayesian Anti Spam

3/10

Background

Junk E-mail ( Spam ) Wastes user time Fill-up file server storage space Influences company’s daily work

Definition Unwanted Commercial Depends on user’s measure

Page 4: Semantic Parsing in Bayesian Anti Spam

4/10

Naïve Bayesian Anti-Spam

Scan ParseData Sets Token Sets

Classifier

User Judge & Update

Judge

Page 5: Semantic Parsing in Bayesian Anti Spam

5/10

Naïve Bayesian Anti-Spam (cont')

A automated Method[1] To learn from data in a user’s mail

repositoryTo adapt to the changes over timeTo be personalized of one user’s mail

[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to filtering Junk E-mail[C]. In:Proc of the AAAI Workshop on Learning for Text Categorization,1998,pp.55-62,1998

Page 6: Semantic Parsing in Bayesian Anti Spam

6/10

Naïve Bayesian Anti-Spam (cont')

Hypothesis Independence assumption

[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to filtering Junk E-mail[C]. In:Proc of the AAAI Workshop on Learning for Text Categorization,1998,pp.55-62,1998

Naïve Bayesian Network A more complex Bayesian Network

Page 7: Semantic Parsing in Bayesian Anti Spam

7/10

Naïve Bayesian Anti-Spam (cont')

||

1

spamtspam

)|()(

)C|()()X|P(C C

ispamti

tspam

CXPCP

XPCP

n

iispam

n

iispam

tspam

XCPXCP

XPCP

11

spamn21spam

))|(1()|(

)C|()()X,...,X,X|P(C

Learning Procedure

Judge Procedure

Page 8: Semantic Parsing in Bayesian Anti Spam

8/10

法轮功 法律

法:1律:1

法:1轮:1功:1

A New Mail

P(法):0.5P(律):0.5

P(法):0.3P(轮):0.3P(功):0.3

SpamData Set

HamData Set

P(Spam | 法):0.3 / (0.3+0.5) = 0.375P(Spam | 轮):0.3 / (0.3+0) = 1P(Spam | 功):0.3 / (0.3+0) = 1P(Spam | 律):0 / (0+0.5) = 0

功律

P(Spam | 功, 律)=P(Spam | 功)*P(Spam | 律)/P(Spam | 功)*P(Spam | 律)+(1-P(Spam | 功))*(1-P(Spam | 律))=0

A Example

Page 9: Semantic Parsing in Bayesian Anti Spam

9/10

Shallow Syntactic Parsing

Syntactic Features [2] POS Chunk Dependency Relations [3] Predicate-argument Structure

Named Entities WordNet Senses Class-Specific Related Words

[2] X Li, D Roth, K Smal, The role of semantic information in learning question classifiers. Proceedings of the International Joint Conference, 2004[3] K Hacioglu, Semantic role labeling using dependency trees ,Proceedings of the 20th international conference on Computational Linguistics, 2004

Page 10: Semantic Parsing in Bayesian Anti Spam

10/10

Shallow Syntactic Parsing

A Example of Dependency Relations .

Page 11: Semantic Parsing in Bayesian Anti Spam

11/10

My Approach Motivation

To rise the precision( reduce error judge from ham to spam) General Parsing is inefficiency. Few attempts to study syntactic information in the context

of classification [2]

Domain Specific Properties Phrases (e.g., 出售发票 ) [1] Dependency Relations [3]

[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to filtering Junk E-mail[C]. In:Proc of the AAAI Workshop on Learning for Text Categorization,1998,pp.55-62,1998 [2] X Li, D Roth, K Smal, The role of semantic information in learning question classifiers. Proceedings of the International Joint Conference, 2004[3] K Hacioglu, Semantic role labeling using dependency trees ,Proceedings of the 20th international conference on Computational Linguistics, 2004

Page 12: Semantic Parsing in Bayesian Anti Spam

12/10

My Approach (cont')

Feature SpaceWordsPhrasesDependency Relations in sentence

Page 13: Semantic Parsing in Bayesian Anti Spam

13/10

My Approach (cont')读取一个词元

判断词元词性

保存特征(当前动词,最近名词)

保存特征(相邻动词,当前动词)

保存特征(相邻副词,当前动词)

保存特征(最近动词,当前名词)

保存特征(相邻形容词,当前名词)

如果是动词 如果是名词

保存特征(当前词)

其他

刷新对过去词的记录

读取下一个词

Page 14: Semantic Parsing in Bayesian Anti Spam

14/10

My Approach (cont') : A Example

Origin 现有一部分普通发票代开

Token 现 /tg 有 /vyou 一部分 /m 普通 /a 发票 /n 代开 /vt

Naïve Features ( 现 /tg),( 有 /vyou),( 一部分 /m), ( 普通 /a),( 发票 /n),( 代开 /vt)

Syntactic Features ( 有 /vyou, 发票 /n) , ( 普通 /a, 发票 /n) , ( 发票 /n, 代开 /vt)

Page 15: Semantic Parsing in Bayesian Anti Spam

15/10

Evaluation

Recall:

Precision:

Weighted Error Rate [4] :

[4] I Androutsopoulos, J Koutsias, KV Chandrinos. An Evaluation of Naïve Bayesian Anti-Spam Filtering. Proc. of the workshop, 2000

Fact Spam Fact Ham

Judge Spam A B

Judge Ham C D

BA

A

P

999,,WErr

whereNN

CB

sl

sN

A

CA

A

R

Ns=A+C , Nl=B+D

Page 16: Semantic Parsing in Bayesian Anti Spam

16/10

Evaluation (cont')

Fact Spam Fact Ham

Judge Spam A = 11647 B = 0

Judge Ham C = 3662 D = 4043

Fact Spam Fact Ham

Judge Spam A = 14378 B = 13

Judge Ham C = 930 D = 4042

Naïve Parsing:

Syntactic Parsing:

Syntactic Parsing Naïve Parsing

Recall 76.08% 93.9%

Precision 100% 99.91%

Weighted Error Rate 0.09% 0.34%

Time Cost ( s / 1000 mails )3.81 2.26

Page 17: Semantic Parsing in Bayesian Anti Spam

17/10

Evaluation (cont')

AdvantageHigher PrecisionAcceptable Speed Lower Weighted Error Rate

DisadvantageHigher Recall

Applicable BackgroundServer-side Anti-Spam

Page 18: Semantic Parsing in Bayesian Anti Spam

18/10

Future Work

Map-reduce Other Syntactic Features

Page 19: Semantic Parsing in Bayesian Anti Spam

19/10

Thank you!