Identifying Image Spam Authorship with a Variable Bin-width Histogram-based Projective Clustering

Identifying Image Spam Authorship with a Variable Bin-width Histogram-based Projective Clustering

Song Gao, Chengcui Zhang, Wei Bang Chen

Department of Computer and Information Sciences

The University of Alabama at Birmingham{gaos, zhang, wbc0522}@cis.uab.edu In this paper we present a two-phase spam image clustering framework. The proposed framework performs a histogram based projective clustering on visual

features in the first phase, followed by a text-based clustering in the second phase. There are several contributions in this study. First, we address the complex nature of spam image obfuscation techniques. Second, a multi-clue framework is developed to profile spam images of common spamming sources which provide evidence for tracking spam gangs. Third, projective clustering eliminates the need to choose among distance metrics for clustering analysis, while systematically exploring subspaces that correspond to clusters.

1. Introduction and Motivation

“Image spam is a kind of email spam where the message text of the spam is presented as a picture in an image file” – Wikipedia.

Occurrence rate of spam image in all spam emails is more than 30% in 2006.

Look similar, but essentially not! Wavy images – failed to be detected by text recognition

algorithm, such as optical character recognition (OCR).

Challenges: Current state of anti-spam. The filtering techniques, such as text classification and

image classification. Disadvantages: CANNOT tell the origins of spam.

Goal: Provide scientific evidence to the origins of spam. Assist in tracking down the common sources of the spam based

on spam image clustering.

Group 1 Group 2 Group 3

2. Multi-clue FrameworkA histogram-based clustering framework:

1.Image preprocessinga. Wavy correction.b.Spam image segmentation – foreground and background.

2.Feature extraction• Color features: 6-bit color-code histogram.• Texture features: histogram of gradient direction with

each bin representing k degrees among 360 degrees.• Layout features: proportion of the foreground object

pixels in each 9-grid cell.• Text contents: recognized by performing OCR.

3.Two-phase clusteringa.Histogram-based projective clustering on visual features.b.Text-based clustering on extracted text information.

To extract the embedded texts from wavy images, correction needs to be done by realigning each vertical line to its correct position. Two perceivable approaches are proposed to find the guideline based on which realignment can be done:

3. Wavy Image Correction

Edge

4. Projective Clustering

Y

X

Z

Y

X

Z

12

345

O1*

Signature: O1 [ ]

*

*

*

659

A histogram-based projective clustering algorithm REVBH (Relative Entropy on Variable Bin-width Histogram):1. Constructing a variable bin width histogram for each k-dimensional subspace. (e.g. k=2)2. Detecting dense areas iteratively in each histogram by using our proposed density threshold.3. Converting each object into a signature that describes how that data object is projected into

different subspaces.4. Merging similar object signature entries.5. Assigning data objects to corresponding clusters.

Partition on one dimension by using original histogram and equalized histogram.

The bin-width of each sub-range along one dimension is determined by using Freedman and Diaconis’s rule or Scott’s rule:h = max{2×IQR×n-1/3, 3.5 × σ × n-1/3}

Dense bins are detected in terms of relative entropy metric:

TXxqxp

xpXHi

iT

iir

,)()(

log)()(1

2

XxxpTxpxhr )),((log)()( 2

hr_low(x) ≤ (1/T)Hr(x) ≤ hr_high(x) hr(x) and Hr(X) represents the

relative entropy of a single bin and its corresponding k-dimensional histogram:

Color

Edge-based method: Curve lines that are originally horizontal lines in the undistorted image are served as a guideline for image correction.

Color-matching method: This approach finds the best color match of two adjacent vertical lines by fixing one line and slightly shifting the other line upward or downward.

5. Experimental Results

2100 spam images including 37 wavy images. 476 classes labeled manually. All feature values are normalized into z-score. Clustering results are evaluated by V-measure and the number of

produced clusters.

Dataset V1-measure

Cluster #(class #: 476)

With corrected wavy images 0.9315 471

With original wavy images 0.9248 498

Effectiveness of wavy image correction

Performance comparison between proposed approach and hierarchical clustering

a) Original image

b) Foreground mask after

segmentation

c) Resized illustration mask

for layout feature extraction

…...

1.a

1.b

2

3.a

3.b

…...

Documents

Identifying Image Spam Authorship with a Variable Bin-width Histogram-based Projective Clustering