CHAPTER 2 LITERATURE SURVEY - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/23895/7/07... · 2018. 7. 9. · 2007). The visual similarity between two pages is found using the

32

CHAPTER 2

LITERATURE SURVEY

2.1 INTRODUCTION

Phishing is one of the luring techniques used by phishing artists

with the intention of exploiting the personal details of unsuspected users.

Phishing website is a mock website that looks similar in appearance but different

in destination. Criminals are duping computer users into revealing sensitive

information including bank accounts, social security numbers, credit card

numbers, passwords, and more (Tommie 2005). Phishing attacks cost millions of

dollars in fatalities to business organizations and end users (Geer 2005). The

unsuspected users post their data thinking that these websites come from trusted

financial institutions. Only specialists can identify these types of phishing websites

immediately. But all the web users are not specialists in computer engineering and

hence they become victims by providing their personal details to the phishing

artists (Santhana and Vijaya 2011).

Phishing is continuously growing since it is easy to copy an entire

website using the HTML source code. By making slight changes in the source

code, it is possible to direct the victim to the phishing website. Phishers use

lot of techniques to lure the unsuspected web user. They send generic

greetings to the customers to check their account immediately. They also send

threat messages of account cancellation and directing the users to update their

account immediately to avoid cancellation. Data mining techniques can

improve the assessment of phishing attacks (Xi Chen et al 2011).

33

Phishing is a serious threat to both users and enterprises; numerous

anti-phishing techniques have been developed. In general, the techniques have

been classified as either List-based or Heuristics-based technologies. List-

based techniques maintain a blacklist or whitelist or both. A lot of anti-

phishing methods use a blacklist to prevent users from accessing phishing

sites. These techniques follow a quantitative approach for evaluating the

phishing possibility of a given website using the refined security risk elements

for domain and web page. Design and implementation of the website risk

assessment system for anti-phishing are also included (Young-Gab 2012).

Heuristic-based mechanisms employ several criteria to determine whether a

website is a phishing site or not (Chun-Ying et al 2011). The CAPTCHA

authentication application designed in an economical mode protects the

security unconscious user by enabling safe online banking authentication,

thereby addressing the online banking threats. Solving the general ignorance

of security warning as well as ensuring safe online banking authentication

even on a compromised hosts are the prime challenges of a secure online

banking system. The proposed hardware solutions are not feasible for the

home users due to its exorbitant cost (Leung 2013). Several commercial

digital forensics software suites are available for examining digital media

related to computer crimes. Although these tools provide examiners with

extensive capabilities for forensic examinations, they can have significant

drawbacks in terms of training, initial costs of the tool, and yearly maintenance

upgrades. Alternatively, there are Free and Open Source Software (FOSS) tools

with equivalent functionality that examiners can use to perform most of the same

tasks possible by commercial applications (Philili et al 2007).

2.2 DIRECTIONS OF THE WORK

Detecting and preventing phishing websites are always an

important area of research. Various phishing techniques provide abundant and

34

essential ways for effectively detecting and, protecting the confidential

information of the individuals and organizations. URL plays an important role

in phishing. URL is a major area, through which the websites have been

initiated and through the hyperlink the pages are redirected to the next page.

Redirecting the pages is the vulnerable concept in phishing (i.e.) through the

hyperlink; the pages are redirected to the legitimate site or the phishing site.

Numerous phishing sites are getting added each day. This motivated many

researchers to switch their focus on finding the phishing sites. Many of the

phishing techniques that were applied on websites, were used by the research

community (either with necessary modification or with new proposals) to

protect the individual and organization from their great loss.

Our research work also aims to explore to avoid phishing in E-mail.

Based on the research articles, it is also understood that secret information are

collected through e-mail and tempt the users through attractive

advertisements. This has motivated to do a literature study involving the

phishing detection and prevention in client side techniques and server side

techniques. The rest of the sections discuss the major works carried out

relating to URL verification, parse tree validation, behavioral response, one-

time password mechanism, watermarking mechanism, preventing phishing

through session hijacking and e-mail phishing.

2.2.1 Blacklist and Whitelist

Blacklist and Whitelist are probably the most straightforward

solution for anti-phishing. A whitelist contains URL of known legitimate sites

while a blacklist contains those of known phishing sites. Many current anti-

phishing technologies rely on the combination of whitelist and blacklist. The

representative blacklist or whitelist based systems include PhishTank,

SiteChecker, Google Safe Browsing (www.google.com/safebowsing),

FirePhish and CallingID Link Advisor (www.callingid.com). These

35

anti-phishing solutions are usually deployed as toolbars or extensions of web

browsers to remind the users whether they are browsing a safe website.

Blacklist suffers from a window of vulnerability between the time a phishing

site is launched and the site’s addition to the blacklist. A blacklist of phishing

sites also requires frequent updating but still cannot include

new phishing sites timely. Similarly, a whitelist also needs to update its

content in a large scale. Unfortunately, it cannot include all legitimate sites

(Liu et al 2010).

IP addresses or domain names are used to block the websites but

not possible to block the whole domain. Blacklists get updated in different

speeds, and are varied in coverage, as 47% to 83% of phish appears on

blacklists 12 hours from the initial test (Lorrie et al 2006). They check

whether a webpage is a phishing page or legitimate one, based on its content,

HTTP transaction and search engine results. The URL is checked with the

black list whether it is available or not (Mingxing et al 2011) and classifieds

based on the decision.

Automatic phishing classifiers are built and are currently used to

evaluate phishing websites and maintain the blacklist. This blacklist contains

Page’s URL and hosting information. To evaluate each page, the classifier

considers features regarding the page’s URL, content and hosting information

(Colin 2010). The request and response pairs to the blacklist server and often

includes lookup strings and responses for domains such as microsoft.com.

The “Blacklist” feature place includes of binary variables for

relationship in six blacklists (and one white list) run by Spamhaus. These lists

combined provide at best a 20% error rate. When combined the blacklists

with the SpamAssassin Botnet plugin features in the “Blacklist+Botnet”

feature set, the error rate improves to 12%. These small features sets are

reasonably effective, but including features that result in very large feature

36

sets signi cantly improve classi cation accuracy (Justin et al 2009). The main

trouble with crawling and blacklist offers is that the anti-phishing

organizations are caught up themselves in a race against the attackers.

Unfortunately, there is forever a window of susceptibility during which users

are susceptible to attacks. Furthermore, the methods are only as e ective as

the excellence of the lists that are maintained (Angelo et al 2007). The

NetCraft anti-phishing toolbar (www.toolbar.netcraft.com) prevents phishing

attacks by utilizing a centralized blacklist of current phishing URLs. Other

examples include websense, McAfee’s anti-phishing filter, NetCraft anti-

phishing system, Cloudmark SafetyBar, and Microsoft Phishing Filter (Pan

and Ding 2006). To combine the blacklist and heuristics approach developed

a hierarchical blacklist-enhanced phish detection framework. The key insight

behind this detection algorithm is to detect phish in a probabilistic fashion

with very high accuracy (Guang et al 2010).

Although the process of checking in blacklist and whitelist is

efficient method but only checking in the black list is not an efficient method,

because all the phishing URLs are not available in the blacklist. If there is any

problem in the blacklist (like connection error) the system cannot give the

correct result.

2.2.2 Visual Similarity

The look and feel of a website gives the conviction to the victims

that they are visiting a legitimate website. The three metrics used to measure

the visual similarity are layout similarity, block level similarity and overall

style similarity. Webpage segmentation forms the base to define these

metrics. Salient blocks forms the structure of a webpage and the weighted

average of the similarities between the paired blocks is known as block level

similarity whereas the ratio between the total no of blocks and weighted

number of matched blocks is known as layout similarity. The histogram of the

37

style feature helps in calculating the overall style similarity i.e. the normalized

correlation coefficient of the histograms of two webpages (Liu et al 2005).

The potential phishing pages are compared against the actual pages to assess

the visual similarities between them in the metrics of key region, overall style

and page layouts (Liu et al 2006).

The conversion of webpages into (Fu et al 2006) low-resolution

images helps in representing the image signatures by using the color and

coordinate features. The signature distances of the images of the webpages are

calculated using the Earth Mover’s distance (EMD). Whenever the user is

about to submit data into an untrusted website, warnings are issued based on

the information on the overall visual appearance which includes the styles,

images and text pieces (Eric et al 2008).

EMD involves the Human Computer Interaction, text and graphics

level into it (Revathy and Guhan 2012). The three steps involved in the

preprocessing of web pages are;

Performing normalization,

Getting the image of the webpage from the given URL and

Representing the image of the webpage as the visual signature

of the webpage (including the color and coordinate features)

Txt section with its attributes and the visual image contributes to

the overall look and feel of the webpage. Comparing the background color,

font size width and height of the image and position in the page identifies

similarities (Radha et al 2011).

38

The five features of the webpage that are considered for checking

visual similarity are signature extraction, style of text pieces, embedded

images of the page, URL keywords and the browser rendered visual

appearance of the page. Since victims are usually tricked by the look and feel

of the website, importance to visual perception is given.

DOM-Tree representation is considered to calculate the layout

similarity of two web sites. However the possibility of getting the same layout

from websites with two different DOM-Trees is also there (Angelo et al

2007). The visual similarity between two pages is found using the gestalt

theory, which considers the webpage to be a single indivisible entity.

(Teh- Chung et al 2010).

Image and visual similarities are used in some methods to detect

phishing (Venkatesan et al 2000). Colors of the images provide visual

similarity (Chen et al 2009), colors of the webpages are also used to find

similarities (Huang et al 2010). Effective image based anti-phishing scheme

which use the discriminative key point features of a web page is also there.

Here the invariant content descriptor and contrast Context histogram are used

to identify the similarity between pages. But the common approach is finding

similarity using the vectors related to the images and the distance between

them, which is taken as the degree of visual difference (Mallika et al 2012)

The above techniques work, only if the pages are visually similar

without considering the source code. Therefore, these techniques will fail if

the hackers attempt to see a difference between the webpages.

2.2.3 URL Verification

URL obfuscation has become a key trick among all the tricks used

in phishing activities. Therefore, equipping the user by creating awareness

39

about the obfuscated URLs and how to determine the true nature of the

strange URLs is the need of the hour (Ed Skoudis 2012). URL analysis

involves all the formats of the URL being analyzed but mostly the links

related to login information only (Debra et al 2009). While the difference and

similarities between the URLs are identified with the help of the trusted site,

generally the Link Guard algorithm is applied for analyzing the common

characteristics and hyperlinks (Juan and Guo 2006). Content matching

techniques and DNS queries are also used to identify the malicious URLs

apart from using the regular expressions and hash maps for isolating the

Symantec variations (Pawan et al 2010). Phishwish is not using any training

any training and whitelist or blacklist (Debra et al 2008). PageSafe is a tool that

is based on user input to find the legitimacy of the URL (Sengar and Vijay 2010)

Registering a similar domain to trick the user into a fraudulent site

is becoming common by using the @symbol for redirection. For e.g., in the

case of [email protected] the user may still feel that they

are visiting the site www.paypal.com, but actually being directed to a site

with 123.123.123.123 as the IP address. Therefore checking the URLs for any

special characters gains the importance now (Chun-Ying et al 2011).

The following points were considered in the Phish Market Model

(Tyler and Tal 2010).

Only specific URLs requested by the receiving party are shared.

Providing party was not given information about the URLs,

which was given to the receiving party.

The number of URLs given to the receiving party is tallied in a

secured way.

URLs belonging to the receiving party are not counted.

40

In the anti-phishing toolbars, the user interface should be properly

designed to give the warnings and the options for customizing the settings by

the user. The main components in the anti-phishing client side applications

include; Main User Interface, Critical Warnings & Help Systems (Linfeng

and Marko 2007).

Guessing a class label for suspect websites is done by a technique

by analyzing their textual and visual contents. When these conditions are met,

that website will be assumed to be a phishing website. The different types of

the URLs are as follows (Haijun et al 2011),

• https://signin.ebay.com

• https://www.paypal.com/c2

• https://ssl.rapidshare.com/premiumzone.html

• http://www.hsbc.co.uk/1/2/HSBCINTEGRATION/

• https://login.yahoo.com

• https://www.mybank.alliance-leicester.co.uk/index.asp

• https://www.optuszoo.com.au/login

• https://steamcommunity.com

Naive users deploy security toolbars or phishing filters as one of

the technique for protection against the phishing attacks. But here, misleading

information is given to the victim by forging the results provided to the

security toolbars with the help of poisoned DNS cache entries after setting up

a rogue wireless Access Point (AP) (Saeed and Suku 2010). Therefore

analyzing only the login URLs alone need not give the correct classification.

41

2.2.4 Content-based Approach

A novel content-based methodology for detecting phishing web

sites CANTINA examines the content of a web page to decide if it is genuine

or fake. This is in contrast to other approaches that only take a shallow view

of the surface characteristics of a web page, for example the URL and its

domain name. CANTINA makes use of the well-known Term

Frequency/Inverse Document Frequency (TF-IDF) algorithm that is very

commonly used in information repossession and more purposely, the Robust

Hyperlinks for surmounting broken hyperlinks. Heuristics can notice

phishing attacks as soon as they are launched, without the requirement to wait

for blacklists to be updated. However, attackers may be capable to devise

their attacks to keep away from heuristic detection. In addition, heuristic

approaches frequently turn out to be false positives (wrongly labeling a

rightful site as phishing). Blacklists might have an advanced level of

accuracy, although usually necessitate human intervention and confirmation,

which may use an immense amount of resources. At a latest Anti-Phishing

Working Group meeting, it was stated that phishers are beginning to use one-

time URLs, which direct someone to a phishing site the first time the URL is

used. However, it directs people to the legitimate site afterwards. This and

other novel phishing tactics significantly obscure the procedure of compiling

a blacklist, and can decrease blacklists’ effectiveness (Yue et al 2007).

The CANTINA method and the anomaly based phishing page

detection method are used to fine the legitimate or phishing page (Mingxing

et al 2011). The unstructured content is converted into structured content and

the content is converted into Term-document matrix. The matrix has entries as

the frequency of terms, which occur in a collection of documents.

The rows correspond to the documents and columns correspond to the terms

(Kancherla et al 2012).

42

Heuristics-based schemes employ the characteristics of the content

to decide a phishing attempt. The content can be any communication from an

unverified party such as an e-mail, web page or SMS. It is clear that some

communication types are easier to enter than others such as a web page has

several characteristics that provide proof of authenticity such as SSL

certificate, domain name and so on. The heuristics measured are hyperlink

and anchor text mismatch, presence of external links, IP address in URL and

Dots in URL (Yue et al 2007; Pamunuwa et al 2007).

Numerous industrial anti-phishing products apply toolbars in web

browsers. However, there are a few researchers have even revealed that

security tool bars do not effectively prevent phishing attacks. There is a model

and portrayed phishing by visualizing and enumerating a given site’s threat,

but this method still would not provide an anti-phishing answer (Liu et al

2006). The Term-frequency and Inverse-Document frequency is the most

imperative method for judging the frequency of the words (Gerard and

Michael 1986).

An insecurely coupled semantic data model that can semantically

link assets and drive out implied semantic links with respect to a set of

relational reasoning rules is checked for the content matching (Zhuge 2009).

A number of researchers have carried out experimental studies and

established that if a user was trained to identify phishing attacks, the

probability of being resentful in future was significantly lowered

(Kumaraguru et al 2008). Despite initial concern that a threshold effect may

result from promotion of anti-phishing information, results show consumers

grow more trusting when organizations show more response to phishing

threats (Emiley et al 2008).

43

The hybrid approach consists of an identity-based detection

component and a keywords-retrieval detection component both manipulating

the DOM after the webpage has been rendered in Internet Explorer to get

around intentional obfuscations. It relies on individuality identification to find

the domain of the page’s affirmed identity, and inspects the authenticity of the

webpage by comparing this extracted domain with its own domain via

executing the query of the form in search engines (Guang and Jason 2009).

The content of the pages are checked with the Page Ranking

algorithm (Page et al 1999) and the pages are recovered with two search

engines. The first search engine is the title based search engine and the second

search is the full text with the google search engine. Google uses the number

of factors to rank search results including standard IR measures, Proximity

and Page rank.

The following phishing recognition system is based on an URL

domain distinctiveness & webpage image identical. At first, it recognizes

similar authorized URL’s, using a divide rule methodology and then estimate

the string matching algorithm. For this similar URL and input URL, the IP

addresses will be recognized. If their IP addresses do not require matching

with each other, then it could be phishing URL and phishing statement will be

generated. Then, this alleged URL’s webpage snapshot will be considered as an

image, and the keypoints would be identified and their characteristics and would

be extracted. These traits will be considered using the descriptor. Then, it is

match with the features of certified webpage. If such a matching exceeds threshold

value, then the webpage is considered phishing (Madhuri and Bhaskar 2011).

To address the problem robustly, it is important to build a state-of-

the-art model using Neuro-Fuzzy scheme with five inputs. Neuro-Fuzzy is a

Fuzzy Logic and a Neural Network. Five inputs are tables where features are

stored which include: Legitimate site Rules, User-behaviour profile,

44

PhishTank, User-specific sites and Pop-ups from Emails. The advantage of

five inputs is that they are wholly representatives of phishing techniques and

strategies. Further, training and testing experiments were performed using a 2-

Fold cross-validation method based on Adaptive Neuro-Fuzzy Inference

System to measure the system accuracy and robustness. Cross-Validation is a

testing method and also signifies a group of methods, while in this case it is

used to address over-fitting problems. Adaptive Neuro-Fuzzy Inference

System is a hybrid intelligent system which has the ability for reasoning and

learning (Barraclough et al 2013).

Though various research works have been done based on the

content using different techniques, there is a need for more features for

correct classification like phishing and legitimate. In CANTINA, the users

might still fall victim to fraud if users do not understand what the tool bar

is trying to communicate. It is not the user interface one. After converting the

document into structured content, the resulting data set values need to feed in

the classification model. Considering these necessities, there is a need for a

system that highly focuses on detecting the phishing based on content.

2.2.5 Source code based approach

The Semantic Link Network (SLN), which is a self–organized data

model (Liu et al 2010) for semantically organizing web resources. In this

method the SLN is constructed and reasoned in three major steps:

The associated web pages related to the suspicious webpage are

retrieved. They are retrieved from two resources. One is from

forward links contained in the suspicious web page, and the

other is from a powerful search engine, which returns candidate

web pages with similar text content to the suspicious web page.

45

The Semantic Link network is constructed from the suspicious

web page and its associated web pages.

Reasoning is conducted to mine the implicit association

relations, which are defined as the relations among all web

pages, which include the suspicious web page, and its associated

web pages.

The method (Mona and Omar 2011) discussed the source code

checking in the following ways:

Searched all the images in the website and if there any images

work from another website or have links from another place it

will consider it a phishing character, so all images should be in

the website folder like this <img src="Logo.PNG" border="0"

width="1243" height="302">.

The program is checking the login or submit button and should

make the button action on the website like: login.php. If the

button action link to any IP likes this: 103.838.39.0/login.php or

email or script it will consider a phishing character.

Also it checks for iframe, domain, script tags and popup

window and if it is found, it is considered as a phishing

character.

If there are more than one phishing characteristics of the same

type such as more than one popup window, it will be considered

as phishing character.

46

There are four methods discussed in (Sujata et al 2007) such as

page based, domain based, type based and feature based. Page based is based

on the page ranking algorithm, in domain based checked whether the domain

is available in the domain table or not. The URL is checked in Type based and

words are checked in the word based feature. For example Login and SignIn

are very frequently found in phishing website.

Component assessment model (Indranil and Alvin 2008) was to

determine how well individual bank’s website presented materials related to

security information in the context of anti-phishing.

A method (Faisal et al 2012) that relies on the website structures

and features (i.e. bank name, branch name, base URL, address) represented in

Resource Description Format (RDF) to decide on its legitimacy. The RDF

allows XHTML authors to mark up human-readable data with machine

readable indicators for browsers and other programs to interpret. Each bank

pro le ought to be activated and added to the RDF knowledge base once the

bank has taken the opening authorization from the country’s Central Bank. An

RDF extractor is employed to cull out RDF characteristics from the specified

web page of the bank before the user view the website on his browser to

restrict any scripting hooks. The extracted RDF traits would then be converted

as RDF ontology and entered into the system for a subsequent decision

making process.

The key elements are collected which should be addressed in the

game design framework to avoid phishing attack. The game design

framework is evaluated using phishing attack and game-based anti-phishing

education. The game design frame work was formulated to thwart phishing

attack (Nalin and Steve 2013).

47

From the above research studies it is clear there is finding the

phishing WebPages, only based on source code is tedious process because the

user cannot retrieve the source codes of all the websites. Forming the network

and finding the suspicious pages are taking too much response time.

2.2.6 Behavioral based Approach

The anti-phishing scheme is largely built in the lead of the

recognition of the individuality pertinent anomalies and securitizing how

anomalies marked by means of DOM objects and HTTP transactions. The

work consists of identity extractor and page classifier. The individuality of a

web site is distinct as a set of words. The individuality is highlighted with a

number of objects or features of the web page. The features extracted from the

phishing web site are not matched with the original identity, and then the web

site is the suspected web site. The Support Vector Machine (SVM) classifier

takes the vector as an input and phishing label as the output. The identity

extractor has lower success rate when processing the legitimate pages. This

process neither requires online transactions nor requires users to change their

navigation behavior. If the search engine returns a URL with the same domain

this web page under checking is genuine with a crushingly high probability

(Pan and Ding 2006). Pertaining information extraction and retrieval

techniques (Guang and Jason 2009) are to detect phishing pages. The DOM of

a downloaded page is inspected to make out its identity in the course of

different attributes and classify the actual domain based on identity. This

hybrid based approach consists of an identity based detection component and

a keyword detection retrieval component. Then the suspected page of the

domain’s result is compared with the previous result. If the result set is

mismatching, the downloaded page is phishing. The basic idea is to first

locate entity names in DOM text nodes that are most likely to represent the

site brand name, and then find domains for those domains for those names via

48

searching, compare the matching domains with the page domain to find

identity inconsistency. The whitelists are collected from Google safe

browsing, millersmiles, and white domain service.

Presented the SpoofGuard tool (Neil et al 2004) that suspect

phishing web pages based on heuristics and computes spoof value based on

matched heuristics. The features of stateless evaluation, stateful evaluation

and input data include heuristics matching. If the value exceeds a limit, a page

is suspected to be phishing. This framework is for client side protection.

Stateless method determines whether a downloaded page is doubtful, stateful

method evaluates a downloaded page in light of previous user activity.

Stateless method consists of URL check, Image check, Link check and

password check. Stateful method consists of domain check and referring page.

The last method consists of outgoing password check, Interaction with image

check and check with all post data. This process takes advantage of the

unauthenticated email and weak website authentication.

There is also a spam classifier to render it ineffective by means of a

very precise assault framework, using indiscriminate, focused attacking and

an optimal attacking function. All of these assume that the training model

used for the spam filter is based on naïve Bayes classifier. The defense filters

out dictionary attack messages with complete success. The dynamic threshold

defense mitigates the effect of the dictionary attacks. The main idea is that the

attack node first sends traffic that causes Autograph to mark it suspicious, and

then sends traffic related to legitimate traffic, resulting in regulations that

cause denial of service (Nelson et al 2008). A model called Heuristic-

Systematic Model (HSM), it is a model of information processing. This model

integrates two information processing modes: Heuristic Processing and

Systematic Processing. In this context of phishing attacks, false messages are

used with the intention to mislead the message recipients. There are six

49

heuristics are analyzed with 105 people listed in the school’s staff and faculty

distribution lists. This approach supplies the theoretical support for future

research in both qualitative and quantitative data will be composed to more

meticulously inspect and estimate the research model and hypotheses (Xin et

al 2013). The behavior model approach which has been explained using the

notion of Finite State Machine (FSM); one that records the submission of

forms with random inputs and the corresponding responses. Their tool name

is PhishTester that evaluate both phishing and legitimate websites. This

approach shows that the result in zero false positive and negative warning and

acquire negligible manual effort. This approach detects some XSS-based

phishing attacks also. This approach does not handle random input

submission to forms that contain captchas (Hossain and Mohamed 2012).

The structure created for the source code based on redirect pages and also

considered straddling attack, pharming attack, using on MouseOver to hide

the link and server form handler (Maher et al 2010)

The User Behavior Based Phishing Websites detection (UBPD)

detects phishing websites based on users’ behaviours, not the incoming data

that attackers can manipulate freely. Violation of the binding relationships

cannot be changed no matter what techniques phishers choose to use. As the

evaluation proves, UBPD is able to consistently detect phishing webpages

regardless of how they are implemented as long as they ask for authentication

credentials. In contrast detection systems based on incoming data may find it

difficult to deal with novel and sophisticated spoofing techniques. UBPD

analyses the identity of websites using both IP addresses and domain names,

it can detect pharming attacks, which are undetectable by many existing

systems. Being independent of the incoming data means low cost

in maintenance, the system does not need updating when attackers vary their

techniques, and so we have far fewer evasion techniques to deal with

(Xun et al 2008).

50

A hybrid of black listing, heuristics and moderation based phishing

prevention (Gaurav and Josef 2011) is to classify an incoming web page as

phish or genuine at a speed greater than that is needed to stay ahead of the

phishers. It is implemented as a plug-in P into the browser B. Browser B

performs preliminary heuristic and blacklist checks. If in white list; load URL

if in blacklist; block URL else sent URL to controller for secondary heuristic

checks.

However, there exist unique circumstances where the inputs or

expected outputs are not known in advance. Testing is challenging for these

“non testable” programs (Weyuker 1982). In order to illustrate the consumer

behavior, an empirical field lecture has been carried out to see if factors that

account for successful marketing campaigns may also account for successful

social engineering attacks (Michael 2008).

Apart from these existing clustering approaches, various needs will

arise with the behavior of the user. They require new techniques to be

developed for functioning beyond their current scope.

2.2.7 Data Mining Approach in Phishing

Data mining techniques are known to develop the evaluation of

phishing attacks (Xi Chen et al 2011). The supervised classification technique

is a commonly adopted technique; is a stream of data mining applied to assess

the intensity of various phishing attacks. The hybrid approach is characterized

by a combination of the bit key phrase extraction and the supervised

classification method. It utilizes both; the textual data description of the

phishing attack and the financial data of the company under study so as to

study the intensity of phishing attack. This is done in accordance with the risk

level/ financial loss generating potential of the subject. The adversary aware

classifier model: Support Vector Machines (SVM) model is basically viewed

51

as a signaling game where the beliefs, strategies and probabilities of the

messages types are all rationalized and built-in as prior knowledge, as soon as

a new e-mail arrives. This has permitted the classifier to modify the margin

error parameter dynamically as the game progresses. More exclusively, this is

considered as an important aspect in the miss-classification constraint of the

optimization problem of the SVM algorithm (Sebastian and Gaston 2013).

Various data mining techniques such as decision trees, machine

learning and rule induction can be a functional feature to any model. These

can deliver appropriate solutions to numerous business queries that are

traditional and time-consuming in nature. These are the most important

e-banking phishing website uniqueness pointers accounted by analyzing

massive databases and historical data solely for training intentions (Maher et

al 2010). Data mining techniques may be used to filter out phishing emails

that contain fraudulent messages. This is done by analyzing the headers of

emails and consequently might prevent the extent of malicious emails and

stop crimes such as phishing and distributed denial of service attacks (Airoldi

and Malin 2004). Researchers have also concluded that education levels of

customers, standardization of technology, and sharing of information are

among the most significant policies that determine such attacks (Liao and Luo

2004). To validate an URL embedded in any emails, logistic regression

(Garera et al 2007) and decision trees have been utilized (Ludl et al 2007).

The Anti-Phishing Detector (APD) system embeds the Association Rule

Mining (ARM) technique in an Instant Messengers (IM) in order to detect

deceptive phishing (Mohd and Lakshmi 2012).

The imputation method employs the available data in order to

approximate any missing values. The simplest method of imputation is the

mean imputation. In this method, all missing values of a variable are replaced

by the average value of all known variable values (Little and Rubin 2002). In

52

the K-nearest neighbor (k-nn) method, the missing values are swapped by the

nearest neighbors only. These nearest neighbors are chosen from the complete

cases, which minimize the distance function (Batista and Monard 2003).

Some researchers have even implemented a different and hybrid methodology

that combines data and text mining. Text mining alone has been used to

inspect company news and make inferences about social networks among

different companies (Ma et al 2009).

Many classification methods have been developed with the help of

several learning algorithms such as boosting, SVM, Bayesian and decision

tree. Such Machine Learning (ML) algorithms are principally learning

schemes that adopt sets of rules (Kaitarai 1999). Yet another case is the

Random forest (RF); a ML algorithm that is supported by an amalgamation of

several decision tree predictors. In this case, each tree based on the values of a

random vector sampled separately with the same distribution for every trees

in the forest. This technique has a tremendous precision amongst existing ML

algorithms. It is also an effective method to estimate any missing data and

maintains accuracy while a large segment of data is absent (Huan and Lei

2005).

The Instance Based Learning (IBL) algorithm is a variant of the

nearest neighbor algorithm where, in addition to the conventional system, the

method (i) normalizes various attributes ranges, (ii) processes instances

incrementally, and (iii) has a simple policy for tolerating missing values

(Kruegel et al 2005). In the Decision Tree (DT) algorithm is a simple

algorithm that is based on a set of rules which is advantageous owing to the

sequential structure of the decision tree branches. The conditions and actions

that are significant are inter-linked directly to supplementary conditions and

actions. However, insignificant conditions are ignored. The boosting method

is a known to be a well established scheme for improving the performance of

53

any particular classification algorithm. It constructs a highly accurate

classification rule by combining various simple and moderately accurate

hypotheses into single and independent one (Islam and Zhou 2008). Different

types of machine learning algorithms give high accuracy in detecting phishing

websites (Daisuke et al 2009). There are also several existing techniques

that include several features that can be conceived, while recognizing a

relevant and representative subset of features to create a precise classifier

(Ram et al 2012).

2.2.8 Single Password Mechanism

The recent solution OTP (one time password), is an important

financial security measure that is being used to defend session attacks. On the

contrary, it is inflexible to execute differentiated OTP creation mechanisms in

them (Yunlim et al 2013). The Single Password protocol (SPP) allows a user

to use one single password (for a single user name) for all accounts; reducing

the risk to malicious server attacks. A typical SPP uses two basic techniques:

challenge/ response and one-time server-specific tickets. The succeeding

content explains the working mechanism. For example, consider ‘P’ to be a

single password that a client ‘C’ remembers. When ‘C’ registers with a server

‘S’, it generates a challenge and the ticket verification information is sent and

‘S’ stores them in its password file. Later on, when ‘C tries to login on ‘S’,

and ‘S’ prompts ‘C’ with the stored challenge. Then ‘C’ uses the challenge,

the server’s name ‘S’, and password ‘P’ to issue a one-time server-specific

ticket, collectively with a new challenge and a new ticket verification

information is sent to ‘S’. After that, ‘S’ verifies the received ticket by means

of the stored ticket verification data. If the ticket is valid, then the verification

of ‘C’ is successful, and ‘S’ swaps the stored challenge. Next, the ticket

verification information, new ticket verification information and the one-time

server-specific ticket from ‘C’ are essential (Mohamed et al 2007).

54

It is essential to know that a phishing attack can only be successful

if the attacker has access to the user’s account name, identify of the secondary

channel through which the user receives the one-time password and, the

password used to access the secondary channel. These limitations set hurdles

the phishing process. Furthermore, as preset passwords are not used, phishers

can only obtain user’s login names (Chun-Ying et al 2011).

In the one-time password protocols, a client may use a different

password for every authentication with a server. There are two major

protocols: Lamport’s one-time password protocol (Lamport 1981) and One

Time Passwords in Everything (McDonald et al 1995) and Rubin’s one-time

password protocol (Rubin 1996). Both one-time password protocols have

been worked on prior to the operation of SSL. These protocols are aimed at

preventing eaves dropping attacks. This issue is now suppressed due to the

confidentiality provided by the widely deployed SSL.

The Anti-phishing framework that is based on Visual Cryptography

(Divya and Mintu 2012) tactic conserves the confidential information of users

using a triple layered security. The first layer checks if a website a

genuine/secure or a phishing website. In case the website is a phishing/ fake

website, the image captcha is generated for the specific user (the user who

aims to log in). The image captcha is basically generated by the stacking of

two shares: one with the user and the other with the actual database of the

website. The second layer cross validates the image captcha corresponding to

that particular user. The image captcha can be read by human users only. A

human user can access the website can read the image captcha and can verify

if the site and the user may be permitted. So, using an image captcha

technique, none of the existing machine-based users can crack the password

or other confidential information of the users. Finally the third layer of

security averts intruders’ attacks on the user’s account. This technique

provides further security by restricting the intruder to log in into the account

even after gaining access to the particular user’s username.

55

A cryptographic scheme called Password-Authenticated Key

Exchange (PAKE) allows the users to use familiar ID/ password based

accesses, avoiding the leak of any password information to communication

peers or eavesdroppers. This protocol is developed as a natural extension to

the current HTTP authentication schema such as Basic and Digest access

authentication (RFC 2617). In order to use the PAKE mechanism for such

purposes, a modification was made by preventing credential forwarding

(man-in-the-middle) attacks (Yutaka 2009).

The stratagem of password hashing makes different password for

each session that could maintain the state for client and server for request

response communication; also strengthening validation. The users may be

now free from developing or updating database values at any time (Nitikesh

et al 2013).

Another variant, the graphical password uses a personal image to

construct an image hash. This is provided as the input into a cryptosystem that

then returns a password. Such graphical passwords suppose the user to choose

a certain number of points on the image. The embedded device would then

convert these points into a long alpha-numeric password. With one such

graphical password, a user can engender several passwords from their

respective embedded devices. The image hash algorithm employed by the

device is meant to produce random and unique 256-bit message digests and to

be responsive to subtle changes in an underlying image (John et al 2011).

A new single password-based anti-phishing protocol that is capable

of overcoming such problems and is further secured against possible attacks is

also in use. In this methodology for each login, the client machine's browser

develops a dynamic identity and a dynamic password. The dynamic identity

and dynamic password tend to differ for the same client in varying sessions of

the Secure Socket Layer protocol. For other online accounts, the client's

56

password verifier data is stored on the server; protected with a nonce value,

hash function and the private key of the server (Sandeep et al 2011).

Since it is imperative to discriminate the needs of any user,

password protection is very essential. Hence, there is a need for a better

one-time password system that promises in resolving the security problem and

safe banking-type transaction.

2.2.9 E-Mail Phishing

E-mail is the most used Internet service in the recent past, and has

been the main resource exploited in phishing (Cleber et al 2011). The Simple

Mail Transfer Protocol (SMTP), which is commonly adopted in sending e-

mails, allows anyone to forge the sender’s address (Herzberg 2009). The

methods that are used to spread phishing through e-mails are quiet similar to

spam messaging (Kanich et al 2008). Phishing by itself can be viewed as a

sub-category of spam; perhaps inter-linked (El-Alfy and Abdel 2011). Some

phishing e-mails are restricted according to the other specified requirements

(Jasmine 2008).

An integrated information-processing model of showing the risk to

phishing (Arun et al 2011) studied how one may understand the process of

phishing emails. The factors that manifest the cognitive valuation process and

the outcomes of such an evaluation would affect an individual’s risk to

phishing. Prejudiced by cognitive and information processing activities

individuals levels of motivation, personality, beliefs, knowledge, and their

day-to-day experience may be affected. An integrated model has been studied

using a structural equation modeling technique, on a sample of intended

victims of an actual phishing email that occurred in a well established

university in northeast USA. This type of model concentrates on four

contextual factors: (i) technological efficacy, (ii) the individual's level of

57

involvement, (iii) email load, and (iv) domain specific knowledge. The email

inbox is undoubtedly a dangerous place. But using the pattern recognition

tools it may be convenient to filter a major portion of the elements that would

damage the end users (Sebastin 2013).

The features which are identified in e-mail headers (Cleber et al

2011) are mentioned below:

Hyperlinks with visible text such as a URL, but pointing to a

URL dissimilar from the observable text

E-mail body coded in HTML format

Too extensive URL

Number of domains and sub domains in the URL

Hyperlinks with any visible text, but pointing directly to an IP-

based URL

Images origin as an IP address

Sender domain that are different from some URL domain in the

message body

Images with external domain different from the URLs in the

message body

A robust multi-stage content propelled methodology that may be

utilized as a filter on email servers and web servers, to automatically find out

phishing messages and discover impersonated entities in such messages. The

methodology makes use of entity extraction and topic discovery method for

feature extraction. Named entities are extracted using a Conditional Random

Field (CRF) and the topics are explained using a Latent Dirichlet Allocation

(LDA). A robust classifier is developed using a topic distribution probabilities

58

and lists out the entities as features and a classification method - AdaBoost.

Once such type of content has been categorized as phishing, the methodology

employs CRF to identity the impersonated entity, i.e., the organization that

this phishing attack is impersonating (Venkatesh and Harry 2012). Each

phishing email is viewed as a text file to so as to spot each header element to

distinguish them from the body of the message. The substring within the

subject header and the message body that was enclosed by white space

contains only English alphabetic characters or apostrophes. The email

message will be classified in a sequential fashion (Rafiqul and Jemal 2013).

The Phishing Evolving Clustering Method is another approach

wherein the functions are considered with respect to the level of uniformity

between two groups of phishing e-mails characteristics. This model also

confirmed to be powerful in terms of classifying e-mails into both ham and

phishing e-mails, speed and use of a one-pass algorithm. The model was

developed to work in online mode and showed capability to learn

continuously without using up high memory as it works on a one-pass

algorithm. Hence, data may be accessed from the memory and then rule-

creation is done depending on the evolution of the profile if the features of the

phishing e-mail have been altered. This technique however, needs continuous

feeding (Ammar et al 2012).

E-mail object resemblances that are used to reduce the feature’s

size suggest an innovative idea related to new framework called Phishing

Dynamic Evolving Neural Fuzzy framework. This is capable of identifying

and predicting (“zero-day”) phishing email in online mode with real

implementation in terms of evolving connectionist system. It is a

connectionist architecture that tries to simplify the evolving procedures with

knowledge discovery. This can be either neural networks or a set of networks

that work continuously in time and have the capability to adapt to their

59

functionality and structure through continuous relations within and outside the

system. A hybrid (supervised/unsupervised) learning approach may also take

the merit of both fuzzy logic and machine learning, while considering the

level of uniformity between features of phishing emails (Ammar et al 2013).

This existing process is based on the level of uniformity among four groups of

phishing e-mail characteristics. The method is sub-categorized into four

stages. The first/ pre-processing stage is called as long vector, the second

stage/ e-mail object similarities is the short vector and third stage is meant to

generate the rules.

The hybrid methodology further takes merit of using various

processing techniques to multiple data sources. The uniqueness comes from

the emulator method which is used to categorize web pages addressed by

links that are present in e-mails. The verification of all data types present in e-

mails permits the classification to be independent of external information

sources such as verification lists or web reputation servers and thereby leads

to a much effective decision-making process. Besides the learning abilities of

the system, it permits straightforward upgrading as a new processing method

is now available or because certain new features included in future phishing

attacks may be detected. At present, this methodology is used to develop

webpage classifiers. In this case, the webpage textual content is segregated as

economic or non-economic by adding up the predictions made by various

classifiers that sub-divide each one of the four permitted textual parts of a

webpage. In case the webpage is economic and it has forms that require

confidential information to be entered, the emulator fills in the forms with

fictitious data and then it submits them online (Dolores et al 2007).

Server classifiers and side filters basically extract features from an

e-mail and then train a classifier to distinctly identify phishing e-mails

(spamAssassin.apache.org) is a well recognized host level filter. This is a

60

rule-based filter that essentially needs constant changes for the rule to be

effective. Attackers try to find out the underlying rule and then bypass these

filters by decisively developing an email. PILFER (Fette et al 2007) is another

email classifier that has been trained using traits taken from email data. Both

these filters showed reasonable misclassification rates. PhishGILLNET

(Ramanathan and Wechsler 2012) is another robust content classifier that

utilizes a topic-modeling method. It handles the examples that are unlabeled

and hence saves the time and labor that is involved in human annotation.

There is the necessity to educate the user about different phishing

attacks and patterns of phishing emails. The basic mode by which this can be

attained is by posting help pages in websites and warning the user about

phishing (Robila and Ragucci 2006) calculated the consequences of user

education in differentiating phishing and good e-mails. The study

accomplished that students recognized phishing emails correctly after the

lecture. Students also recognize the convenience of the lecture. Comparable

studies were also performed at the Indiana University (Jagatic et al 2007).

User protection consciousness against phishing e-mails is an

extremely essential matter as its threats not only applies technical subterfuges

to make victims, but also tries to discover the victim’s confidentiality

information by means of social engineering. However, a recent study revealed

that users keep themselves defenseless to phishing even after having attended

a training program (Dodge et al 2007).

A chief motive for phishing attacks is that these are designed to

exploit human cognitive biases rather than pitfalls in the technology itself.

Phishing offenders frequently impersonate as credible figures and broadcast

manipulative messages via emails, instant messages or short messages to

large groups. While the legitimacy of these messages may not be difficult to

disprove with some examination, victims are habitually trapped off-guard at

61

first glance. This therefore manipulates human tendencies and information

processing. Psychological and behavioral factors in general, play a vital role.

Research works have endeavored to investigate factors such as the effect of

experiential and dispositional cues (Wright and Mareett 2010).

Missing data sets may be completed by introducing a technique

called the missing data imputation. If 75% of the software projects reported

overruns, there is a reasonable requirement from the industry for accurate

software project efforts prediction. Unluckily, an obstruction to precise effort

calculations would be incomplete and small (with respect to the number of

cases) software engineering data sets. Consequently, so as to develop effort

prediction, the missing data should be dealt carefully. Although a wide range

of missing data procedures have been proposed, none of them specifically

focus on the missing values in small data sets with both continuous or

nominal values. For this cause an imputation method was explained solely to

handle this common problem observed in software engineering data sets.

Software data sets are very frequently featured by their small size. On the

other hand, sophisticated imputation methods prefer larger data sets. Thus

these are explored using simple methods to impute missing data into small

project data sets. A Class Mean Imputation technique is commenced with

respect to the k-NN imputation method. This is done in order to assign both

the nominal and continuous missing values in small data sets. The use of an

incremental approach to enhance the variance of the population is worth

mentioning (Qinbao and Martin 2007).

As another facet, occupational frauds turn out to be a $652 billion

problem, of which disgruntled employees are a major stake. Much security

research addresses dropping fraud prospects and escalating fraud recognition.

However, detecting motivational factors such as employee disgruntlement is

less researched. The Sarbanes–Oxley Act necessitates that companies record

62

email; fashioning an untapped reserve for fraud identification. Here, protocols

to spot disgruntled communications have been developed. Messages cluster in

accordance with the disgruntled content and so, give confidence in the value

of the email for such a task.

An extremely precise naïve Bayes model predicts if messages have

dissatisfied communications, providing enormously applicable data that may

not otherwise be likely exposed in a fraud audit. The model can be included

into a fraud risk analysis system so as to perk up their capability to notice and

prevent fraud. The naïve Bayes model was preferred to predict the

membership of each e-mail message into a discontented or non- discontented

category with respect to the similarity it offers in terms of the samples of

message nature revealed in the training set. To apply naïve Bayes prediction,

firstly the sparse table is sampled, with a set of randomly selected rows for

training and another set for prediction. Naïve Bayes model performance

would be only faintly enhanced with logarithmic smoothing, which keeps

away from the setback of assigning probabilities with respect to small

numbers that are nearly indistinguishable from zero. Such a model

enhancement could be retained. Mbx is a common non-proprietary electronic

mailbox format where messages are concatenated in a single file, with the

beginning of a new message denoted by a “From” line. This study develops

and successfully tests a text mining instantiation that shows great promise in a

new and difficult domain where the need for solutions is both significant and

urgent (Carolin 2009).

This section presents framework based on a Bayesian approach for

content-based phishing web pages. Such a model considers the textual and

uniformity visual contents so as to establish the secluded web page and other

suspicious web pages. An image classifier, a text classifier, and an algorithm

fusing the results from various classifiers have been introduced. Such a

63

methodology is essential in the classifier in order to determine the class of the

web page and identifying if the web page is phishing in character. The naive

Bayes rule has been employed in the text classified so as to compute the

probability that a web page is phishing. In the image classifier, the earth

mover’s distance is employed to quantify the visual resemblance, and the

Bayesian model is then designed to conclude the threshold. In the data fusion

algorithm, the Bayes theory is used to amalgamate the classification

consequences from textual and visual content. This text classifier was

replicated by the naive Bayes rule and then used to compute the visual

similarity between the given web page and the protected web page. The image

classifier is successfully predicted using a probabilistic model derived from

the basic Bayesian theory. However, this technique is only concentrated on

separate classifiers that have been used here to classify the text and images.

The complexity increases while merging the outputs from both image and text

classifiers (Haijun et al 2011).

The features taken into account for e-mail phishing (Toolan and

Carthy 2009) are mentioned below:

IP address: a binary feature that is 1 if an IP address is found in

a suspect email.

HTML: a binary feature that is set to 1 if the email is written in

HTML.

Script: a binary feature that is set to 1 if the email contains

JavaScript code.

Number of URLs: A numeric feature that represents the total

number of links found in a suspect email.

Maximum number of periods in a URL: the highest number of

periods or dots found in a URL in a suspect email.

64

The email classification and filtering, more specifically on

spam vs. ham and phishing vs. spam classification are based on content

features (Juan and Erik 2012). We can test the validity of several new

statistical feature extraction techniques too.

2.2.10 Session Hijacking

Several web-based features use a type of session management in

order to create a user-friendly environment. Sessions are saved on a server

and then are associated with their respective users using session identifiers

(IDs). Generally, these session IDs appear as a simple target for attackers,

who by getting access to them, effectively hijack the users’ identities

(Mitja 2002).

Even if a token is correctly produced and remains unpredictable,

attackers would intercept it. Another means to intercept these tokens is by

identifying them from log files. These include: browser logs, server proxy

logs and, web server logs. If the token is passed as a URL variable, an

attacker can easily read it on the log. In order to lower the temporal window

for such attacks, these sessions require being as short as possible. Some

applications supply no means for a session’s running out, which facilitates

attackers to use multiple values trial and error before a session expires. When

a user logs out, the server can delete such a token from the user’s browser.

However, if the user is capable of sending a previously used token, the server

would keep accepting it (Corrado et al 2010). Automatically it performs all

the steps in that particular session fixation as how real attackers It can also

lighten the load of test operators by only presenting a number of data as

follows (Yusuke 2010).

65

The parameter name that contains an SID

Special Key words that only appear in the response message

after valid user's login

Attacker and Victim's login information

This information methodology automates the identification of

vulnerabilities with the packet capturing, initial inspection and attack

simulation.

For a session identifier to be tedious to be randomly guessed, it

must appear random, have reasonably good entropy, and have a sufficiently

large set of probable data values from which it has been chosen. If the value

isn’t satisfactorily random, then analysis of the values used can show us a way

to unimportantly presume a fine value. If the set of possible values isn’t suf-

ficiently large, attackers can merely attempt each feasible value on a trial

basis until they find an acceptable one (Stephen 2011).

The internet’s open architecture and the use of open standards like

Session Initiation Protocol comprise the allotment of services, defenseless to

known internet attack; while introducing novel security troubles based on

these standards that cannot been handled with security mechanisms currently

available. They have identified and described security problems in the SIP

protocol that may escort to rejection of service. Such security troubles

comprise of: security vulnerabilities in parser implementations, flooding

attacks, and attacks that exploit vulnerabilities at the signaling-application

level (Dimitris 2006).

The concept accomplishes all these goals by keeping track of the

network access from key external transit networks to one’s own network. This

is achieved by means of the lightweight prefix-owner-based active probing

66

technique. Using the prefix-owner’s view of access, the detection system

(Zheng 2010), iSPY, can distinguish between IP prefix hijacking and network

failures with respect to the observations that hijacking is likely to have

consequences in topologically more diverse polluted networks and

unreachability.

Dynamic techniques based on runtime monitoring scrutinize the

program implementation and check whether this execution satisfies the

security policies in question. These techniques are known to enforce a class of

security policies that are based on a single program execution. Secure

information flow control techniques propose program analysis to find the

flows of information inside the program. A particular definition of security

policy enforced by these techniques is non-interference. It states that no secret

inputs to the program can influence publicly observed outputs (Nataliia 2013).

Since it is not possible to detect such information flows by observing only one

program execution, the definition of non-interference is based on two

program executions. Each method recognized only a subset of the

vulnerabilities (Andrew et al 2013) which, for the majority part were

independent of each other.

2.2.11 Limitations of the Existing System

With regard to the itinerary, the literature survey presented here

unveils some limitations in the existing approaches. But client side phishing

attack is available in the same side sever side phishing attack is also available.

Both play a major role in websites. The views of various researchers are used

in visual similarity and E-mail phishing. Researchers have not yet explored

the contribution of URL verification with various issues. However, these

approaches fall short of addressing the following issues in particular.

67

The existing tools are have limited protection, and it will give

the load to the client browser and there is no up to date update

of URL database.

The common string in the URL mechanism is not applicable if

the phishing URL is well structured.

The credibility of the webpage content was not checked using

reliable parameters. The features of the contents are not checked

with the correct classification tools.

More response time for some of the systems like visual

similarity, semantic Link Network Connection. Phishers are

acting very speedily and gently so the response time of

detecting the phishing should be very low.

Forms of the various WebPages are not validated properly and

all types of Heuristics are not considered. Login forms only

validated in most of the systems.

Finding the phishing target is the most important issue in the

websites, but the existing techniques are not using the effective

method for finding the phishing target. They did not use the

inferring rules and strategies.

The existing e-mail phishing tools require more user action. The

response time is too high. Mails will not be marked as spam by

default. Most of the e-mail phishing tools work in Post Office

Protocol.

The features set are not the convinced one in the existing

systems. The major issue is the preparation of the feature sets to

be used in the data mining algorithms.

68

Moreover, differences in the quality of phishing emails might

further heighten individual phishing susceptibility and did not

use control condition.

Stemming from the lack of control group, it is difficult to

ascertain whether the effects found in the study are reflections

of only phishing email processing or reflections of general email

use behavior.

It is relatively easy to detect phishing activities with most of the

existing anti-phishing techniques if the attacks only target a

small number of well-known websites.

Web pages could take a long time to load, especially when

using a slow speed dial-up connection. Therefore an incremental

detection within a certain period of time is needed while a page

is still loading.

The constructs and heuristics need to be further operationalized

to enhance their rigor and appropriateness for further

parsimonious investigations.

2.3 PROBLEM STATEMENT

The problem taken up for study in this research is the need for a

detection and prevention of phishing, for the client side techniques and for the

server side techniques. The client side phishing consists of three stages of

filtering techniques, such as URL verification; parse tree validation and

behavioral model approach. The server side phishing consists of one-time

password mechanism, watermarking, and preventing phishing through session

hijacking and detecting e-mail phishing. The aim of the study is to first to

collect all the URLs from the web, and check their formats. Second, the

contents, source code, and all the forms are validated and finally the e-mail

contents are checked based on the legitimate features.

69

2.4 MOTIVATION OF THIS WORK

The work presented in this thesis presents the multi-stage filtering

technique, that enhanced the classification of phishing and legitimate

websites, and finding the phishing target for preventing the phishing. The

client and server side’s phishing attacks are analyzed, and better false

negative and false positive results have been obtained. This work implements

a multi-stage filtering approach to overcome the limitations of the existing

systems, including processing time and similarity and compared the model

with the supervised learning methods. Thus, suitable models based on

different techniques for better classification of the result and predictions of

the phishing targets are presented.

The following are some of the solutions to tackle the detection and

prevention of phishing websites and e-mails.

To identify the different formats of the URL of the websites and

the hyperlinks of the source code, with the URL Validator and

the black listed and white listed verification.

To effectively represent the phishing target and phishing

websites, the parse tree with the top hyperlinks of the specified

domain name, is constructed and validated.

To detect the phishing website based on the heuristic techniques

using this novel approach, which has more precision recall and

F-score values over the existing clustering techniques.

To speed up the phishing detection processing time in

comparison with the existing detection methods.

To effectively prevent the usage of phishing websites based on

server side phishing prevention techniques.

70

To identify the phishing E-mails based on the multi-tier

classification model, and avoid unnecessary banking

transactions.

2.5 THESIS OBJECTIVES

In order to gain a better insight into the problem of phishing and

legitimate website classification, systematic approaches based on different

techniques have been presented. The main challenge in the existing

techniques is that the detection methods have not achieved good classification

results, and the processing time was high. The present thesis aimed to

overcome these important challenges using the Detecting and Preventing

Phishing Websites (DPPWS) model. Against this background, the present

study is aimed to evaluate the legitimate and phishing websites and the

phishing target both on the client side and the server sides.

This research is aimed at the classification part of the problem,

using the standard blacklisted database called phishtank for checking the

URLs.

The study has the following specific research objectives:

To check the different characteristics of the URL with the

updated blacklisted database, and to design the XML schema to

validate the hyperlink.

To convert the top 10 results of the google search engine into a

tree structure using the XML and to perform the tree traversal to

validate phishing; this has to be confirmed with the text relation

between the phishing target page and the suspicious page.

71

To design a heuristics model to find out the behavior of the

users with form validation; the heuristics were selected as

features for the machine learning algorithm to classify the

website as legitimate or phishing, and to find out the different

types of attacks with heuristics combinations.

To develop a multi-tier classification model for e-mail phishing

and to find out the missing values using the K-Clustering

algorithm. The output of the e-mail classifiers is used as training

data for the supervised learning algorithm to classify the emails

as phishing or legitimate.

To implement a session hijacking system for fixing the session

with a dynamic session ID and to stop the inconsistent URL.

To design a watermarking system with the secret code to

increase the credibility of the website during the registration

phase.

To create a one-time password mechanism with the secret code

and the acknowledgment for providing the credibility during the

login phase.

To achieve the above objectives, the architecture of the Detecting

and Preventing Phishing Websites (DPPWS) system has been proposed in

Chapter 3.

Documents

CHAPTER 2 LITERATURE SURVEY - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/23895/7/07... · 2018. 7. 9. · 2007). The visual similarity between two pages is found using the