Upload
jack-hodge
View
212
Download
0
Embed Size (px)
Citation preview
SVM based Spam Filtering in SEWM2007
Pan Weike, Lu Guanzhong, Xu Congfu
[email protected], [email protected], [email protected]
College of Computer Science, Zhejiang University
March 11, 2007
feature extractionpre-processing
charset decoderTokenization
(Tianwang Chinese segmentation)
statistics à term list(without TF, CHI, IG, etc.)
Support Vector Regression(libSVM)
Vector Space Model(TF*IDF,
subject:body=diff. w. )
reorganization score (spam)
charset translation(e.g. BIG5 à GB2312)
HTML parser
corpus(training/testing data)
subject, body, etc.
Chinese Anti-spam Framework
Outline
Email Pre-processingFeature ExtractionSupport Vector Regression
Email Pre-processingOriginal
Mails
RawProcess
Decode Subject and Content
Extract Charset Charset File
Html Parse
Delete Html Tag Script Css
Charset Change
Change Charset
Extract InformationExtract Subject
and Body
Change GBKAnd Big5 to
Gb2312
GBK Big5 change
Processed Mail
Some problems: An email may contain more than 2 ch
arset types. The charset information of some ema
ils are missing. An efficient approach to obtain the ac
curate charset information of each email is needed.
Feature Extraction
Tokenization: Tianwang Chinese algorithmhttp://net.pku.edu.cn/~webg/src/ChSeg/Without Feature Selection: TF,CHI,IG, etc.VSM: TF*IDF, subject:body=3:1
Support Vector Regression
SVR toolbox: libSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Thanks for your attention!