47
1 Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD) Speaker Po-Jiu Wang Institute of Information Science Academia Sinica Author Anthony Y. Fu Department of Computer Science, City University of Hong Kong IEEE 2006

1 Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD) Speaker Po-Jiu Wang Institute of Information Science

Embed Size (px)

Citation preview

1

Detecting Phishing Web Pages with VisualSimilarity Assessment Based on EarthMover’s Distance (EMD)

SpeakerPo-Jiu WangInstitute of Information Science Academia SinicaAuthorAnthony Y. FuDepartment of Computer Science, City University of Hong KongIEEE 2006

2

Outline

What is phishing Various phishing techniques Previous anti-phishing works Evaluating webpage distance with EMD

What is EMD, and its advantageColor and its coordinate distance with EMD

Conclusion and tentative work to do

3

What is phishing

Phishing is a criminal trick of stealing personal information through requesting people to access a fake webpage.

How to “request people to…”?Phishing email, BBS, chatting room, etc.Spoofing: free gift, identity confirmation etc.

4

Various phishing techniques

The most straightforward way for a phisher to spoof people is to make the appearance of webpage links and webpages similar to the real ones.

5

Various phishing techniques (Link based phishing obfuscation)

The link based phishing obfuscation can be carried out in four ways below:

Adding suffix to domain name of URL. E.g., revise www.citybank.com to www.citybank.com.us.ebanking;

Using actual link different from visible link. E.g., the HTML line: <a href="http://www.citibank.com.us.ebanking"> www.citibank.com</a>;

6

Various Phishing Techniques (Link based phishing obfuscation 1)

Using bug in real webpage to redirect to other webpages.E.g., the bug of eBay website: http://cgi.ebay.com/ws/eBayISAPI.dll?MfcISAPICommand=RedirectTo Domain&DomainUrl=PHISHINGLINK can direct you to any specified PHISHINGLINK;

And replacing similar characters in the real link. E.g., replace “I”s (uppercase “i”) with “l” (lowercase of “L”) or “1” (Arabic number one), such as WWW.CITIBANK.COM to WWW.C1TlBANK.COM.

7

Various Phishing Techniques (webpage based obfuscation)

The webpage based obfuscation can be carried out in three basic ways below

Using the downloaded webpage from real website to make the phishing webpage appear and react exactly the same with the real one;

8

Various Phishing Techniques(webpage based obfuscation 1)

Using script or add-in to web browser to cover the address bar to spoof users to believe they have entered the correct website;

And using visual based content (E.g., image, flash, video, etc.) rather than HTML to avoid HTML based phishing detection.

9

Previous Anti-Phishing Works

Anti-SpammingPhishing email is spam. Phisher do email

address harvest, and broadcast to the potential victims.

Human aidedBanks employ a group of people to monitor

the Phishing activities. E.g. HSBC

10

Previous Anti-Phishing Works (1) Duplicate document detection approaches,

which focus on plain text documents and use pure text features in similarity measure.

11

Motivation

Phishing Web pages always have high visual similarity with the real Web pages.

An effective approach called image-based EMD is proposed to calculate the visual similarity of Web pages.

12

Evaluating webpage distance with EMD EMD is Earth Mover’s Distance and it is

based on the well known transportation problemSuppose we have m producers

P={(p1,wp1),(p2,wp2)…(pm,wpm)}N customers

C={(c1,wc1),(c2,wc2)…(cn,wcn)}Distance matrix D=[dij] is given

13

Evaluating webpage distance with EMD (transportation fee)

The task is to find a flow matrix F =[fij] which contains factors indicating the amount of product to be moved from one producer to one consumer.

14

Evaluating webpage distance with EMD (total cost of transportation fee)

The total cost of transportation fee can be represented as:

1 1

COST(P,C,F)=m n

ij iji j

f d ST:

1

1

1 1 1 1

0 1 ,1

1

1

min( , )

ij

n

ij pij

m

ij cji

m n m n

ij pi cji j i j

f i m j n

f w i m

f w j n

f w w

15

Evaluating webpage distance with EMD (final equation of EMD)

The EMD can be represented as:

1 1

1 1

( , , )

m n

ij iji j

m n

iji j

f d

EMD P C Df

16

Advantage of EMD

Represent problems involving multi-featured signatures

Allow for partial matches in a very natural way

Fit for cognitive distance evaluation

17

Color and its coordinate distance with EMD (Preprocess image data)

Preprocess image data Compress them to 10*10 pixes

Experiment shows that the calculation time can be heavily reduced through image size compression without reducing the precision an recall

E.g.

18

The calculation of the distance of pixel color and coordinate Get the signature of webpage1 and webpage2

using pixel color and coordinate Calculate D=[dij]. dij=Distance(Color(pixeli), Color(pixelj)

, Coordinate(pixeli), Coordinate(pixelj)) EMDColorAndCordinate=

EMDDist(Signature1,Signature2, D)

19

The improved color space

The color of each pixel in the resized images is represented using the ARGB (alpha, red, green, and blue) scheme with 4 bytes (32 bits).A degraded color space called Color Degrading Factor (CDF) is needed.

Thus, the degraded color space is (28/CDF)4.

20

The centroid of degraded color space

The centroid of each degraded color is calculated using:

,

1

dcNdc i

dci dc

CC

N

The centroid of degraded color dc

The coordinates of the ith pixelthat has degraded color dc

The total number of pixels that have degraded color dc

21

Computing visual similarity from EMD First, the normalized euclidian distance of the

degraded ARGB colors is calculated, and then the normalized Euclidian distance of centroids is calculated.

22

The maximum color distance

Suppose feature where

,feature ,where , the maximum color distance, the maximum color distance is

,ii i dcdc C

,jj j dcdc C , , ,i i i i idc dA dR dG dB

, , ,j j j j jdc dA dR dG dB

23

The normalized color distance

The normalized color distance NDcolor is defined as

24

The normalized centroid distance

The maximum centroid distance MDcentroid =

where w and h are the width and height of the resized images, respectively. The normalized color distance NDcentroid is defined as

2 2w h

25

Final equation of EMD

The two distances are added up with weights p and q,respectively, to form the feature distance, where p+q =1.

26

Computing EMD-based visualsimilarity of two images

(0, ) is the amplifier of visual similarity

27

An improved adjusted threshold for classification

(1 )i protectedT i N

A special threshold for each given protected web page is used to classify a web page to be a phishing web page or a normal one.

denotes the threshold of theith protected Web page

arg min( ( )) ,i iT MissClassification t t VSS

28

Two types of misclassifications

False alarm The visual similarity is larger than or equal to t but, in fact, the web

page is not a phishing Web page (false positive).

Missing The visual similarity is less than t but, in fact, the web page is a

phishing one (false negative).

VSSi correlates to two accessory parameters, the false alarm number and false negative

29

The way to classify phishing page When a suspected web page comes, the visual

similarity vector which can be represented as

and the classification result using the following equation:

1 2, ,.........,protectedNVS vs vs vs

1 max( ) 0( )

0 max( ) 0

if VS TIsPhishing VS

if VS T

30

Experiment configuration of phishing detection performance 10,272 homepages are selected from the web. 9 phishing web pages which targeted at 8 real

protected web pages. The 10,272+9 web pages are mixed together to

form the Suspected Webpage Set. Randomly selected 1,000 web pages from the

10,272 ones, combining with the 9 phishing webpages to form the Training Webpage Set.

31

Train a threshold vector We use the Train Webpage Set to train a

threshold vector

Protected Webpage Threshold(T)

real-Bank of Oklahoma - Online 0.8469

real-ebay1 0.9434

real-eBay2 0.9493

real-ICBC(Asia) 0.7385

real-Key Bank 0.9323

real-us bank 0.9573

real-Washington Mutual 0.8541

real-Wells Fargo Sign On 0.9255

32

Classification precision, phishing recall, and false alarm list( = 0, 9281 Suspected Web Pages)

33 Reduce false negative possibilities !!

Classification precision, phishing recall, and false alarm list( = 0.005, 9281 Suspected Web Pages)

34

Phishing detection performance of image-based EMD

There are 65 false alarms

35

Phishing detection performance of HTML/DOM-based EMD

There are 849 false alarms

36

Phishing detection performance of similarity assessment-based EMD

There are 697 false alarms

37

Experiment results

The threshold vector to is used to classify an suspected webpage.

In order to reduce false negative possibilities,

there is a necessary sacrifice needed under

Empirically set the parameters w =h =100, =0.5,|Ss| =20, p=q=0.5, and CDF=32 in our experiments by tuning.

0.005

38

The number of ground truth web page for each protected web page

39

The configuration of tuning the parameters Take as the sample

number for each protected web

If a web page in the Nsample collected web pages is in the corresponding ground truth group, it is counted as a correctly detected similar web page.

5,10,.....,50sampleN

40

Tuning the parameters (w and h) We have four configuration options (w=h =10,

,100, and ) to tune w and h.10 10 100 10

41

Tuning the parameters (p and q) 11 configuration options (p : q =0 : 1; 0:1 : 0:9; 0:2

:0:8; . . . ; 0:9 : 0:1;1:0) to are used to tune p and q.

42

Tuning the parameters (sample color number) Six configuration options (|Ss| = 5, 10, 15, 20, 25,

and 30) are used to tune |Ss|.

43

Tuning the parameters (CDF) Eight configuration options (CDF =8, 16, 24, 32,

40,48, 56, and 64) to tune CDF.

44

The built architecture anti-phising system

45

Conclusions

This approach works at the pixel level of Web pages rather than at the text level.

Experiments show that our method can achieve satisfying classification precision and phishing recall.

The time efficiency of computation is also acceptable for online phishing detection.

46

Tentative works

Continue with more phishing examples and even larger scale datasets.

The method could not detect those which are not visually similar.

Keep working on developing a client-side application

47

Thanks for your attention.