[IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

An Efficient Method for Filtering Image-Based SpaiE-ma

Ngo Phuong Nhung, Tu Minh PhuongPosts and Telecommunications Institute of Technology, Vietnam

maxnhong85ggmail.com, phuongtmgfpt.com.vn

Abstract Spam e-mail with advertisement text embedded in which extracts and recognizes embedded text, followed by aimages presents a great challenge to anti-spam filters. In this text classifier that separates advertising text from legitimatepaper, we describe a fast method to detect image-based spam e- content. While this solution promises to detect spam with amail. Using simple edge-based features, the method computes a certain level of accuracy, the existing OCR algorithms arevector of similarity scores between an image and a set of computationally intensive and thus cannot operate on heavilytemplates. This similarity vector is then used with support vector loaded e-mail servers.machines to separate spam images from other common categoriesof images. Our method does not require computationally . t overc the stag. of Ocr -baseds nsinexpensive OCR or even text extraction from images. Empirical recent paper, Aradhye et al. [1] described an image-based anti-results show that the method is fast and has good classification spam filter that does not require full text recognition. Theiraccuracy. method starts by extracting regions with overlaid text from

images. Based on the text regions and other image elements,

Keywords -Image-based spain, anti-spai filtering, SVM, the method creates several simple features that are indicativeclassification of spam. Images represented by the extracted features are then

classified into spam and non-spam using SVM. Since theI. INTRODUCTION method does not include the text recognition step, it is much

The increasing number of Internet users and the low cost of faster than systems with full OCR. However, the extraction ofe-mail make this form of communication very attractive for text regions is nontrivial and still requires considerabledirect marketers. With the availability of enormous e-mail computational resources.address lists, it is possible for advertisers on the Internet to In this paper, we propose a fast method to detect spamsend messages to millions of recipients at essentially no cost. images. The proposed method does not try to extract embeddedAs a consequence, the volume of unsolicited commercial e- text from an image. Instead, it uses an edge-based featuremail ("spam") has grown tremendously in past few years. The vector, which can be computed efficiently, to represent majorvast amount of spam being sent has been considered a serious shape properties of the image. Since most of spam imagesproblem threatening the utility of e-mail and the Internet in contain large proportions of text (figure 1), they must havegeneral. shape representation similar to that of other text intensive

In addressing this growing problem, many solutions to images. Our method uses the edge-based feature to compute aspam reduction have been proposed. Among these solutions vector of similarity measures from an image to a small set ofare automated methods for filtering spam from legitimate e- gold standards - images with different proportions of overlaidmail. Using hand-crafted rules or machine learning techniques, text. These similarity vectors then serve as input for SVMsanti-spam filters analyze the text content of e-mail to detect training and classification. The method is fast because it doesspam. Some anti-spam filters using machine learning not use computationally expensive image processing and textalgorithms like Naive Bayes or Support Vector Machines recognition steps. On a collections of images, the method(SVM) are reported to achieve the spam detection accuracy of separates spam images from non-spam ones with highup to 99°O [9]. accuracy.

To circumvent such systems, spammers have invented Related work.many techniques. An example of techniques spammers use isto embed advertising text in images being sent with spam. The problem of image-based spami detection is a specialWhile the contents of such messages are normally viewed by case of image categorization, which has been extensivelyspam receivers, they are shielded from text-based anti-spam studied in context of many important applications. Dependingfilters. By some estimates, up to 25% of spam being sent today on application requirements and the nature of images tocontain imagery and this number is expected to increase [1]. categorize, generic image categorization methods can useTherefore, it is desirable to develop systems that can detect and different image features or their combinations to distinguishfilter image-based spain. between two or more classes, In a work by Hu and Bagga [3],A possible way to detect image-based spain is using a the authors relied on the correlation between image functional

pipeline of an optical character recognition (OCR) system, categories and several features such as: whether the images are

1-4244-0695-l/07/$25.00 ©2007 IEEE. 96

BeStNewStbok foYea 2006.EXPLOSIV'E PICK FOR OUR MEMREBRSI

GetRRBK First Thing Monday, This Is Going ToE4lode BONE-DENSITY DRUGS BACKFIREnext 2; da sill Xt24rwSlil R ~~NALREADY SUFFERING WOMEN!17Trading Date: 08 May 2006 AM you postmenopausal and taking FosamCopanyR ANK INVTCO

sto }< f~~~~~~~~~~~~orahother drug to imp3XVe bohesteenj#h?StoGk RfRBKOpenlinig Ph se: t1.09 Thousands of women ike you-presoibed FosamaxLU 'nhg: New Atruel AOdia or Zomde are ex:erierdngoBUY.:"Strg Se4e tInfectoneExpebtatons Max Swelling & LooOeningo afth T0

DrainageSiginifidant hot and Medium term trading p ts in * Expo1 BoneRRBK are b1eing expected, a:nd BIG PR campaign on e JLuGIi can d eservd!l Yu ma ua[ytufdrway in th e next ew weeks. Watch out fOr huge news111 mon tcompensaoi fbom the manufeutrtWhen this Stock moves - WATCH OUTI This is your chanceto get in whileit is stil low. Big Watch iniplaythist Monday mhdllinggl Put R K on your adais n _ ow

(a) Image containing only text (b) Image with photographic elements

Figure 1. Examples of spam images

A weaith to yomAnid the best thatII]OIioUJV to jou..)

TrraififfinalTov.

(a) Natural scene photo (b) Greetings e-card

Figure 2. Examples of non-spam images

graphic or photographic, whether the images have text variance of texture [13] to locate blocks of text.elements. They used frequency domain analysis of image A recent work on image-based spam categorization [1]intensity and DCT coefficients to decide about the presence of makes use of text extraction to create image feature. Inthese features. The learning and classification steps were done combination with SVM learning, this method is reported towith SVMs. Gavilan et al. [2] represented an image in term of achieve 80% and higher accuracy in detecting spam, whileblobs - image regions lighter or darker than background - does not require text content recognition.which can be extracted by color quantization. The blobrepresentations of the images are then used to train neural II. ALGORITHMnetworks to distinguish among natural, artificial, portrait, or Our proposed spam detection method uses SVM, atext images. Another interesting application is indoor vs. discriminative learning algorithm, combined with vectoroutdoor classification [10]. representation of images to distinguish between spam and

Beside spam images, many other categories of images and non-spam images. The algorithm consists of three stepsvideos contain text. Since overlaid text contains important outlined next.information about image content, the extraction of text blocks 1) Feature extraction and normalization using Edgeand recognition of their content have attracted research Directions (ED) or Edge Orientationinterest in the image processing and multimedia analysis Autocorrelogram (BOAC). This step summarizescommunities. Previous work on text extraction used different shape properties of an image in term of edgecharacteristics of text regions such as their color [6], the orientation and correlation. Depending on whichfrequent occurrence of vertical edges, or wavelets and spatial feature is used, the result of this step is an edge

97

direction histogram or an EOAC matrix. gray-scale.2) Calculation of similarity scores between the image 2. Finding prominent edges: only edges with amplitude

and a small set of templates or sample images higher than a predefined threshold T, are extracted. Incontaining only text. This step allows representing our experiment we used T, = 25 as in the original paper.the image as a vector of similarity scores with respect 3. Edge orientation quantization: Edges are quantized intoto the templates. k segments kGI, ZG2, ...k,ZGk, each segment is equal

3) Training and classification with SVM. The vector to five degrees. The result of this step is the EDrepresentations of images as computed in step 2 are histogram.used to train SVM and subsequently to classify each 4. Determining distance set: This step constructs anew image as spain or non-spain. distance set D, member of which are the distances fromIn the following sections, we provide a detailed description the current edge. This set is used to compute the

for each of these steps. correlations in the next step. We used the set of fourA. Edge directions and edge orientation autocorrelogram members as in the original paper D = 1, 3, 5, 7}.In image categorization and image retrieval applications, it 5. Computing EOAC matrix: The EOAC matrix is a two-

is important to choose appropriate features to represent dimensional array with k rows and IDI columns. The <i,images. In recent years, researchers have been proposed a j> element of this matrix contains the number of similarnumber of image features, each of them is good for edges with the orientation ZGj, which are i pixelcharacterizing certain aspects of images. Since spam images distances apart. Two edges are defined to be similar ifare text intensive, and text elements have special shape the absolute values of their orientations amplitudescharacteristics which make them different from that of differences are less than an angle and an amplitudebackground scenes or other elements, it is desirable to use threshold value.features able to capture such characteristics. At the same time, Figure 3 shows three images [(a), (b), and (c)], theirfor the method to be practical, a feature of choice must be fast respective EOAC graphs [(d), (e), and (f)], and their EDto compute. histograms [(g), (h), and (i)].

Although other features like color-based features, texture- Since the positions of objects in an image have no effectbased features, blobs, graphic-photographic features have been on edge amplitude and orientation, the ED and EOAC featuresused with success for other image categorization tasks are translation invariant. This is a desired characteristic in our[2,3,8,10], edge-based features are very indicative of text case because text can be located anywhere within an image.intensive spam images while remain simple to compute. In To achieve also scaling invariance, the ED histogram and thethis work, we have chosen two edge-based features, namely EOAC matrix should be normalized. Specifically, the EDEdge Directions (ED) [4,5] and Edge Orientation histogram is normalized with respect to the number of edgeAutocorrelogram (EOAC) [7]. The ED feature is simply points of the image, and the EOAC matrix is normalized bycomputed as a histogram of edge angles and summarizes dividingthe population of each EOACs bin bythe sum oftheglobal shape information. EOAC is an extension ofED which populations of all EOACs binsuprovides a proper way to express correlation between textelements over small distances, and is inexpensive to compute B. Calculation ofsimilarity scoresat the same time. If the proportion of overlaid text within an image is large

For the self-completeness of the paper, we give a brief enough, the contribution of the text to the image's ED anddescription of ED and EOAC, the reader is referred to the EOAC will dominate that of the other image elements. As aoriginal papers [4,7] for more details. Note that the ED consequence, two images containing large amounts of texthistogram of an image can be obtained as an intermediate tend to share similarities in their shape representations. Theresult when computing the image's EOAC. algorithm proposed herein exploits this observation to

The ED histogram of an image is computed in three steps, distinguish text intensive images from others. Specifically, inand the EOAC is computed by adding two more steps as this step, the algorithm computes similarity scores of thefollows. image with respect to a small set of n sample images or

1. Edge detection: The Sobel operator' is used to generate templates. Here we define a template as a speciallytwo edge components G, and Gy, from which edge constructed image that contains only text. The proportions ofamplitude and edge orientation are computed as text as well as text characteristics are chosen to vary amongfollows: different templates so that the set covers a large variety of text

cG= 2 2 overlaid in images.Figure 3 shows a non-spam image (a), a spam image (b),IG = tg-' (Gy / G,) and a template (c) as well as their respective EOAC graphs

To perform this step we first transformed color images to [(d)-(f)] and ED histograms [(g)-(i)]. As shown in the figure,the EQAC and ED representations of the spain image sharesome similarities with those of the template while the

1 In the original version of ED, edges are detected by using the representations of the non-spain image look very different.Canny operator

98

1200

300 A80

-200 600-400

100200 `

0 10 20 30 40 50 60 70 0 1n On an An Sn An 7n0 10 2030 40 5 60 700 10 20 30 40 50 60 70

(d, (g)(a) Non-spam image

5000

4000-

300001000-

IW ~ ~ S.= .--2--h 10 0 30 40 5 6 0

|0 1 0 20 30 40 50 60700* jj 0 wmmdniotzFlfeefF.HrM.9 - j.9 ...f maif.-W Po 100 . ..... 0 50 60 7

XiRl0inhpThenA iiyimgthe y.i hX you

__________________________________________(e)(b) Spain image (h)

Protesters stormed the headquarters ofHungary'sstat tlevisaround the building early Tuesday after the cO 1Mn-rysprime mi 5000 _economy ''l1roughout the past one and a half or twMo years."

Smoke and tear gas wrealtedthe headquarters of state broad 4000 iwiliidemonsfrators around midnight Monday ~ 160001 IOfficers turnednatercannonTon protesters, someuof whom wem 54000buiSdinge andseterar officers were injured during the deheonstb *Nenet said. >|I1 i1

The rmoil exploded Sunday when stt radio aired an audio 000aOfficerbounebdwser anomninirouhtitestestsoynearnofw Sim e400 1000I

Gyurcsany telling members of his ruling Socialist Party that hisstate ofthe count0y's economy t0roughou0is 40

o years in offi60 7 0- u 10 0

00 10 20 30 40 50 60 70

0 10 20 30 40 50 60 70

(c) Template image (f) (i)

Figure 3. An example of shape representation using EOAC and ED; (a) - (c) show three images, where (a) is non-spam, (b) spam, (c) a template; (d)-(e)their respective EOAC graphs; and (g)-(i) their respective ED histogram.

To measure the similarity between an image with EOACX with respect to the templates. In machine learning community,and a template with EOAC Y we use LI distance computed as the idea of representing an object via its similarity to a set offollows: other objects is known as empirical feature map [11]. The

k d merit of the empirical feature map is that it provides a generalLl(X, Y) =Y, Xf -Y way to map from similarity scores to vector representation,

ivi p=4 from which a proper kernel can be constructed to use withFor the ED feature, the similarity is computed similarly. support vector machines.At the end of this step, each image is represented by a C.SpotvcrmahnlenigndlasJatn

vector of length n, elements of which are the similarity scoresWith vector of similarity computed above, we train SVM -

99

a machine learning algorithm that shows good performance in into 10 folds of equal size so that the proportion of spam anda number of classification tasks - to differentiate between the non-spam images in each fold is (almost) the same. One foldtwo classes: spam and non-spam. The SVM algorithm relies is left as the test set and the other folds were used for training.on two main ideas. First, the algorithm maps the given training The experiment was repeated 10 times with different foldssets of n-dimensional vectors with positive and negative being the test sets; the classification accuracy is calculated byexamples into a (possibly) high-dimensional feature space. averaging over 10 runs.Then, in the feature space, the algorithm seeks to locate a In our evaluation, we used spam recall and non-spamhyperplane with two properties: 1) it separates the positive recall defined as follows:from the negative examples; 2) it maintains a maximummargin from any example in the training set. The criterion of spam recall = # ochoosing the hyperplane with maximum margin contributes # of all spammost to the power of SVM. It has been shown by theory and in # of non - spai correctly classifiedpractice that such a hyperplane gives the best generalization non - spam recall =when classifying unseen examples [12]. Having found such a # of all non - spaihyperplane, the SVM predicts the label of a new example bymapping it into the feature space and defining on which sideof the hyperplane the example is located. The algorithm can These are two measures that were used in the TREC 2005also be extended to deal with outliers in the training set and Spain Track: spain recall iS the proportion of spain messageswith m - clas wition, blocked by the filter (complement of spam misclassification

The mapping from the input space to the feature space is rate), and non-spam recall is the proportion of legitimatedone by using a so called kernelfunction. The SVM can take messages that passed the filter (complement of hamdon byusigaso alld krnefuntio. Te SM cn tke misclassification rate). Note that in general, blocking aas input vectors or kernel matrices with pre-computed values legi)timatethatainggeneallowin aof the kernel function for each pair of examples. In this work, legitimate message iS considered worse than allowing a spawe used similarity vectors as input to the SVM and tried message passing the filter. Thus, obtaining high non-spamseveral kerel functions defined on input vectors, recall has higher priority than obtaining high spam recall.

SVM training. We used WEKA (www.cs.waikato.ac.nz/III. EXPERIMENTS AND RESULTS ml/ weka) - a collection of open source machine learning

Dataset. Unlike the situation with text-based anti-spain software - to conduct our experiments. All the similarityfiltering experiments, for which a number of public vectors were normalized so that their length is equal to unit. In

benchmark datasets are available, to our best knowledge, there all the experiments we used SVM with soft margin andis no public benchmark dataset for image-based spai. To parameter C = 1. We tried different kernel functions and foundcreate a dataset for our experiments, we collected images from that the linear kernel (which means no mapping) and the RBF

spam messages a'.ivinat anemailserver.Weusekernel give the best results. In what follows, only resultsspain messages arriving at an e-mail server. We used images. . .from a spam message only if the message does not contain text obtained with the linear kerel are reported.in its body and hence the image content provides the major Template set construction. An important step of thesource of information to make spai-non-spai decision. algorithm is constructing the set oftemplates (sample images).Images that exist only for formatting or purposes that are Since the algorithm makes sense only when an image containsirrelevant to message content were not included. dominating amount of text, the templates were constructed so

The spam part of the dataset contains 411 images, about that text regions cover more than 500 of each template.half of them contain only text without complex background Specifically, we used black-white templates with text areas

(figure 1 a). The other images contain graphic and covering from 5000 to 900o of the whole images with interval*elements * * of 10%. Letters were chosen uniformly so that all the letters ofphotographic elmnswith different levels Of complexity as

in figure lb (the level of complexity here is defined as the the English alphabet appear with the same frequency. A more

number of edgesina unit area). sophisticated way is to generate letters with the frequenciesThmber non-spain imagenin our dt they appear in normal text, but in this work we did not use it.The non-spam images in our dataset were collected from Wetidsvrlcmol sdfn aiislk ire eseveral sources. We asked our friends and colleagues to We tried several commonly used font families like Times New

donate e-cards they received over email. The e-card collection Rmn,EAnal Tahmand threi italicalndo fae variants.wasaumente by fre e-ad donoae from diffren Since EQAC is invariant with respect to scaling, the choice of

wes with d ef text Ts resultaedin28 dima allre font size is not critical. In our experiment, we used font size =cotintext. Wefaurt.h ra selted 300images

all10. To avoid unexpected effect when comparing images ofcotinn tet We furhe radol seece 300 images different sizes, we used ten sets of templates each of themfrom the CorelDraw collection. Finally, following the work dif ts se tenpsetsofthe mpaTes wechof em

presented in [1] other 723 images were collected by querying consiss o te of thesame ize.vTh sz were chenth ogl-mge erc nie ihkewrs*ntr from 80x60 to 800x600 pixel with interval of 80x80. Whenthe Google-images search engine with keywords "nature .

photo","portra1t" and "baby" and an I COMpUtin the similarity vector for the given image. thefro wha th enin rend template set with the size closest to the image's size is used.

Evalatio mehodoogy We ssesed he etho byResults. In the first experiment, we compared ther 11 1-1* 1 1 1 1 1-

using performance of two versions of the method - with ED and10-fld ros-vaidaton.Thedatsetwas andmlydivded EQAC used as image features. We also examined how the use

100

of templates with different font families affected the spam The next experiment was designed to evaluate thedetection accuracy. In addition to three fonts that are performance of the method on different categories of non-commonly used in the Internet namely Times New Romans, spam images. The method was run with EOAC as imageArial, and Tahoma, we used two other template sets feature and the template set constructed from Tahoma fontconstructed with Gothic and Lucida fonts. The spam recall and family. In figure 4, non-spam recall for each image category isnon-spam recall obtained when using ED and EOAC with the plotted against spam recall. The results show that the methodfive fonts are summarized in table 1. can accurately distinguish spam images from "nature photo"

images - both spam and non-spam recalls are higher than 9300.TABLE 1 At the same time, e-cards proved to be most difficult to

SPAM RECALL AND NON-SPAM RECALL FOR DIFFERENT EDGE-BASED distinguish from spam images. Images of this category alwaysFEATURES AND DIFFERENT FONT FAMILIES, contain some amount of overlaid text, which make them

Times Tahoma Arial Gothic Lucida similarto spai.NewRoman IV. CONCLUSION

Spam We have described a new method for detecting spam e-recall 0.73 0.75 0.74 0.74 0.79 mail with content embedded in images. Given an image, our

ED method first extracts an edge-based feature, which summarizesNon- 0.88 0.88 0.88 0.88 0.87 the global information of the image. It then computes a vectorrecall of similarity scores between the image and a set of templates

containing only text. This vector representation of the image isSpai used as input for support vector machines training andrecall 0.80 0.83 0.80 0.79 0.86 classification. The use of edge-based feature allows capturing

EOAC Non- regularities in shapes of text intensive spam images while does0.87 0.87 0.88 0.87 0.84 not require costly computations. Empirical tests show that our

spam method achieves overall accuracy of 80% and higher inrecall classifying spam from different categories of images whilesremains fast. Given the complexity of the problem, these

The results show that, in average, the two edge-based results are encouraging and the proposed method can be usedfeatures have nearly the same non-spam recall, but EOAC as a starting step for the construction of image-base anti-spamgives significantly higher spam detection accuracy. At the filters.same time, computing EOAC is more expensive than ACKNOWLEDGMENTcomputing ED. On a PC with Pentium IV and 512 MB RAM,the computation of EOAC and ED for 1000 800x600 images This work was supported by Ministry of Science andtakes 260 seconds and 125 seconds respectively. Technology of Vietnam under a grant for fundamental

As expected, the proposed method is not sensitive to the researchchoice of font families used in template construction step.Except small fluctuations in cases of Tahoma and Lucida, the REFERENCESdifferent font sizes give nearly the same classification [1] H. B. Aradhye, G. K. Myers, and J. A. Herson, "Image Analysis foraccuracy. A possible explanation is that the small differences Efficient Categorization of Image-based Spam E-mail", Proc. of thein shapes of the letters from different font families are Eighth International Conference on Document Analysis andRecognition (ICDAR'05), Seoul, Korea 2005, pp. 914-918.smoothed during the edge orientation quantization step. [2] D. Gavilan, H. Takahashi, and M. Nakajima, "Image Categorization

Using Color Blobs in a Mobile Environment," Computer Graphics1- Forum (EG 2003), 22(3), 2003, pp. 427-432.

[3] J. Hu, and A. Bagga, "Categorizing Images in Web Documents,"0.9g IEEE Multimedia, 11(1), 2004, pp. 22-30.[4] A.K. Jain and A. Vailaya, "Image retrieval using color and shape",

Pattern Recognition 29 (8) (1996) 1233-1244.P 0.8 - A [5] A.K. Jain and A. Vailaya, "Shape-basedretrieval: a case study withE * trademark image database", Pattern Recognition 31 (9) (1998) 1369-CL 0.7 ~~~~~~~~~~~~~~~~~1390.X0.7 * Nature photol [6] R. Lienhart, and W. Effelsberg, "Automatic Text Segmentation and0 Text Recognition in Video Indexing," ACM/Springer Multimedia

0.6 - Portrat systems, Vol. 8, 2000, pp. 69-81.A E-cards [7] F. Mahmoudi, J Shanbehzadeh, A. Eftekhari-Moghadam, H.

0.5 - Soltanian-Zadeh, "Image retrieval based on shape similarity by edgeorientation autocorrelogram", Pattern Recognition 36 (2003) pp. 17250.5 0.6 0.7 0.8 0.9 1 - 1736.

Spain recall [8] W.K. Pratt, Digital image processing. 3-d edition. John Willey &Sons 2001l.

[9] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. "A BayesianFigure 4. Spain and non-spain recalls for different image categories Approach to Filtering Junk E-Mail". Proceedings of AAAI-98

101

Workshop on Learning for Text Categorization, 1998. [12] V.N. Vapnik. Statistical Learning Theory. Adaptive and learning[10] M. Szummer, and R.W. Picard, "Indoor-Outdoor Image systems for signal processing, communications, and control, Wiley,

Classification," Proc. IEEE Intl. Workshop on Content-Based Access New York, 1999.ofImage and Video Databases, 1998, pp. 42-5 1. [13] H. Li, D. Doermann, and 0. Kia, "Automatic Text Detection and

[11] K. Tsuda, "Support vector classi. cation with asymmetric kernel Tracking in Digital Video", IEEE Transactions on Image Processing,function". Proc. of 7-th European symposium on Artificial Neural 9(1), 2000, pp. 147-156.Networks, 1999, pp. 183-188.

102