7
SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu [email protected] , [email protected] , [email protected] College of Computer Science, Zhejiang Univers ity March 11, 2007

SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu [email protected]@zju.edu.cn, [email protected], [email protected]@[email protected]

Embed Size (px)

Citation preview

Page 1: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

SVM based Spam Filtering in SEWM2007

Pan Weike, Lu Guanzhong, Xu Congfu

[email protected], [email protected], [email protected]

College of Computer Science, Zhejiang University

March 11, 2007

Page 2: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

feature extractionpre-processing

charset decoderTokenization

(Tianwang Chinese segmentation)

statistics à term list(without TF, CHI, IG, etc.)

Support Vector Regression(libSVM)

Vector Space Model(TF*IDF,

subject:body=diff. w. )

reorganization score (spam)

charset translation(e.g. BIG5 à GB2312)

HTML parser

corpus(training/testing data)

subject, body, etc.

Chinese Anti-spam Framework

Page 3: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

Outline

Email Pre-processingFeature ExtractionSupport Vector Regression

Page 4: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

Email Pre-processingOriginal

Mails

RawProcess

Decode Subject and Content

Extract Charset Charset File

Html Parse

Delete Html Tag Script Css

Charset Change

Change Charset

Extract InformationExtract Subject

and Body

Change GBKAnd Big5 to

Gb2312

GBK Big5 change

Processed Mail

Some problems: An email may contain more than 2 ch

arset types. The charset information of some ema

ils are missing. An efficient approach to obtain the ac

curate charset information of each email is needed.

Page 5: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

Feature Extraction

Tokenization: Tianwang Chinese algorithmhttp://net.pku.edu.cn/~webg/src/ChSeg/Without Feature Selection: TF,CHI,IG, etc.VSM: TF*IDF, subject:body=3:1

Page 6: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

Support Vector Regression

SVR toolbox: libSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Page 7: SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cnpanweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cnoillgz@gmail.comxucongfu@zju.edu.cn

Thanks for your attention!