Bot Analysis EBay

1

CHAPTER 1

INTRODUCTION

A brief description about the organization in which the project is undertaken is described

in this chapter. The development environment of the system is also specified in this chapter. It

gives a brief introduction about the various technologies and tools used in the development of the

system.

1.1 ORGANIZATION PROFILE

eBay, is a global online marketplace where anyone can trade anything. eBay is a platform

for the sale of goods and services by a diverse community of individuals and businesses. It is

one of the world's premier online properties and the corporate home of a number of successful

Internet brands having a global presence in 39 markets, including the U.S. With more than 222

million registered members from all around the world and nearly 120 million active users,

eBay's core marketplace site hosts millions of retail and wholesale transactions in some 30

countries every day.

Headquartered in San Jose, California, eBay corporation has two core businesses, a

payments business of which PayPal is the flagship brand and a marketplaces business of which

the eBay website along with its various international versions are the flagship brand. Founded in

1995 by 28-year-old Internet entrepreneur and the current Chairman Pierre Omidyar, eBay is

headed today by the President and CEO John Donahoe.

At any time, there are approximately 100 million listings on eBay worldwide, and

approximately 6.6 million listings are added per day. eBay users trade in more than 50,000

categories including collectibles, antiques, sports memorabilia, computers, IT and office, art,

2

antiques, toys, dolls, stamps, comics, magazines, music, pottery, glass, photography, electronics

and gemstones. It features a variety of international sites, specialty sites, categories and services

that aim to provide users with the necessary tools for efficient online trading in the auction-style

and fixed price formats. The Company has built specialized experiences for certain vertical

formats, such as Daily Deals, Fashion, Motors (vehicles, parts and accessories), and Electronics.

eBay India (formerly Baazee.com) was launched in India in 2000 and has become one of

India's leading online shopping destinations. Although eBay is a global company, eBay.in offers

a trading platform tailored to the unique needs of Indians. eBay India has a community of over 5

million registered users. These users come from over 4,306 cities in India. At any given time,

there are over 2 million live listings on eBay India (www.ebay.in) across 2,000 categories of

products in Electronics, Lifestyle, Collectibles and Media verticals.

Approximately 30,000 sellers sell on eBay India annually. eBay India can be accessed on the

mobile web on m.ebay.in and via a suite of Mobile Apps for the iPhone, the iPad, the Android,

Windows and Nokia.

The collective impact of eBay on e-commerce is staggering: In the second quarter of

2013, eBay Inc. enabled more than $51 billion of Commerce Volume (ECV). ECV is the total

commerce and payment volume across all three business units consisting of Marketplaces GMV,

PayPal merchant services net total payment volume and GSI Global Ecommerce (GeC)

merchandise sales. eBay.com users worldwide trade $2,352 worth of goods on the site every

second.

http://www.ebay.in/

3

1.2 TEAM PROFILE

The YELLOWSTONE project plans to address many of the log data processing or

analyzing issues present currently like

Multiple hops for data

Non-guaranteed messaging

Long processing times

Limited processing throughput

Data quality concern

As a part of this, the filtered log data (logged by the user and the system) is to be

stored to Hadoop in a small latency. This raw data then can be used to perform bot-detection as

appropriate. The log records contain multiple tags (key value pairs) that are of interest for

behavioral analysis. User would want to know all log tags for a given time period in a specific

environment in order to perform meaningful analysis. A web interface is provided that can be

used to perform requests, view tags in tabular form and export tags in raw data or OData format.

OData is being used to expose information and make it available to host of customers

through an http endpoint. Extracted raw data could be imported in analytic tools like tableau.

With the data exposed as OData as well as CSV (Comma Separated Values) files it can be

connected to Tableau easily and start analyzing the data.

The Yellowstone team focuses on overcoming the shortcomings present in the data

processing by bringing down the number of nodes and reducing both noise and time delay at

which the data is available for analyzing. The data collection source is a logger publisher from

where certain metrics are stored to a persistent location in Hadoop as shown in the

architecture diagram depicted in Figure 1.1. The architecture has moved from data being

stored in filers to Hadoop which has reduced the cost to a very great extent. Also, with the

4

shift to Hadoop the data retention period has increased to 120 days. With the number of

hops through which the data passes being reduced the failure rate has decreased and also

the latency period has shortened from over a day to around 20 minutes.

Figure 1.1 Architecture Diagra m

5

1.3 SYSTEM CONFIGURATION

System configuration refers to the software and hardware configuration used in the

development of the project.

1.3.1 Hardware Configuration

Processor : Intel® Core(TM) i7-2600

Clock Speed : 3.40GHz

RAM : 8GB

Hard Disk : 500GB

1.3.2 Software Configuration

Platform : Windows7 Enterprise

Language : JAVA

Tools : Eclipse Juno, Apache Maven, Tableau, Excel

Technologies : Apache Hadoop, Apache Pig, OData

Web Application Server : Tomcat

1.4 TOOLS AND TECHNOLOGIES USED

This section deals with the tools and technologies used in the development of the project.

1.4.1 Java

Java has been tested, refined, extended, and proven by a dedicated community. And

numbering more than 6.5 million developers, it is the largest and most active on the planet. With

its versatility, efficiency, and portability, Java has become invaluable to developers by enabling

them to write software on one platform and run it on virtually any other platform, create

programs to run within a web browser and web services, develop server-side applications for

online forums, stores, polls, HTML forms processing, and more, combine applications or

6

services using the Java language to create highly customized applications or services, write

powerful and efficient applications for mobile phones, remote processors, low-cost consumer

products, and practically any other device with a digital heartbeat and so on.

1.4.2 Eclipse

Eclipse is an open source community, whose projects are focused on building an open

development platform comprised of extensible frameworks, tools and runtimes for building,

deploying and managing software across the lifecycle. It is written primarily in Java. A large

and vibrant ecosystem of major technology vendors, innovative start-ups, universities, research

institutions and individuals extend, complement and support the Eclipse platform.

1.4.3 RIDE

RIDE (Raptor Integrated Development Environment) is eBay's Eclipse-based

development environment. It includes Eclipse IDE, custom eBay and open source eclipse

Plugins, the Geronimo Application Server, and IBM's JDK. It is the required environment for

developers writing Java-based applications for eBay.com/Marketplace. It is written primarily in

Java and can be used to develop applications in Java and, by means of the various plugins, in

other languages as well.

1.4.4 Github

Github is a web-based hosting service for software development projects that use the Git

revision control system. It also provides social networking functionality such as feeds, followers

and the social network graph to display how developers work on their versions of a repository.

Github provides both GUI and command line tool (Git Bash) for accessing the Git. The flagship

functionality of GitHub is forking – copying a repository from one user’s account to another. This

enables a user not having write access to a project to fork it and modify it under a registered

account. If the changes made are wished to be shared, a notification called a pull request can be

7

sent to the original owner. That user can then, with a click of a button, merge the changes found in

the sharer’s repository with the original repository.

1.4.5 Apache Maven

Apache Maven is a build automation tool used primarily for Java projects. Maven

serves a similar purpose to the Apache Ant tool, but it is based on different concepts and

works in a different manner. Maven is essentially a project management and comprehension

tool and as such provides a way to help with managing Builds, Documentation, Reporting,

Dependencies, SCMs, Releases and Distribution.

The Maven project is hosted by the Apache Software Foundation, where it was

formerly part of the Jakarta Project. Maven's primary goal is to allow a developer to

comprehend the complete state of a development effort in the shortest period of time. In order to

attain this goal there are several areas of concern that Maven attempts to deal with:

Making the build process easy

Providing a uniform build system

Provide an easy way to see the health and status of a project

Providing guidelines for best practices development

Allowing transparent migration to new features

Preventing inconsistent setups and focus energy on writing applications

1.4.6 Tableau Software

Tableau is business intelligence software that allows anyone to connect to data in a few

clicks, then visualize and create interactive, sharable dashboards with a few more. It is easy

to use and powerful enough to satisfy even the most complex analytical problems. It is data

analysis software that lets one to just drag and drop to create interactive graphs from any data.

8

Tableau automatically reads the field names from the data and populates the dimensions and

measures areas. Measures contain numerical data like number of records fetched, w h ile

dimensions contain categorical data. The variables can be examined visually in just a few clicks

and embedded on a website or a blog easily. The findings can be securely shared with others

easily within seconds.

1.4.7 OData

OData is a standard for providing access to data over the internet. It has been developed

by Microsoft as an open specification. It is a Web protocol for querying and updating data that

provides a way to unlock the data and free it from silos that exist in applications today. OData

does this by applying and building upon pre-existing internet protocols, which means that web

developers can use it in their applications with a much easier learning curve.

OData provides access to just the data you want. It has scalability built right in to the

protocol. Using the conventions of OData, highly specific requests can be made to get a single

data item or relationships can be quickly uncovered using the feature of linked-data.

OData works just like the web: each record returned by an OData request contains links to other

related records in exactly the same way as web pages contain hyperlinks to other web

pages. The protocol also allows publishers to tag and label data with domain specific

vocabularies.

OData also supports a very advanced query language that enables filtering and ordering

data based on any criteria. OData also has support for advanced features like

asynchronous queries and requesting just the changes since the last query. These features

dramatically improve the speed of queries against large datasets. OData also provides a uniform

way to represent metadata about the data, allowing computers to know more about the type

system, relationships and structure of the data.

9

1.4.8 Apache Hadoop

The Apache Hadoop is an open source software framework that allows for the distributed

processing of large data sets (structured and unstructured) across clusters of computers

using simple programming models. It is designed to scale up from single servers to thousands of

machines, with a very high degree of fault tolerance, each offering local computation and

storage. The Hadoop framework transparently provides both reliability and data motion to

applications. Rather than relying on high-end hardware, the resiliency of these clusters

comes from the software’s ability to detect and handle failures at the application layer.

Hadoop implements a computational paradigm named MapReduce where the application

is divided into many small fragments of work, each of which can execute or re-execute on any

node in the cluster. In addition, it provides a distributed file system that spans all the nodes in a

Hadoop cluster for data storage, providing very high aggregate bandwidth across the

cluster. It links together the file systems on many local nodes to make them into one big file

system. Hadoop Distributed File System (HDFS) assumes nodes will fail, so it achieves

reliability by replicating data across multiple nodes.

1.4.9 Apache Pig

Apache Pig is a high- level platform developed to simplify querying large data sets in

Apache Hadoop using MapReduce programs. Pig is made up of two components: the first is the

language itself, which is called PigLatin, and the second is a runtime environment where

PigLatin programs are executed. PigLatin is a flow language which abstracts from the Java

MapReduce idiom into a form similar to SQL. PigLatin defines a set of transformations on a

data set such as aggregate, join and sort. Pig translates the PigLatin script into MapReduce so

that it can be executed within Hadoop. PigLatin is sometimes extended using UDFs (User

10

Defined Functions), which the user can write in Java or a scripting language and then call

directly from the PigLatin. The salient property of Pig programs is that their structure is

amenable to substantial parallelization, which in turns enables them to handle very large data

sets.

1.4.10 Jenkins

Jenkins is an open source continuous integration tool written in Java. Jenkins provides

continuous integration services for software development, primarily in the Java programming

language. It is a server-based system running in a servlet container such as Apache Tomcat. It

supports windows batch commands, shell scripts and can execute Apache Ant and Apache

Maven based projects.

11

CHAPTER 2

OVERVIEW OF BOTS

The study done on the log records for behavioral analysis and the presence of bots in the

data is detailed in this chapter. Further, the need to detect the bots upfront in the application

server has been highlighted.

2.1 BACKGROUND

The business metrics collected as part of the team must be analyzed in depth to define

and observe user behavior. Analysis of various attributes provides insight on the general

behavior of a user which is highly necessary due to reasons such as:

Creating a user profile that aids in understanding the types of people that visit the

website.

Predicting whether a customer is likely to purchase through our website

Improving customer satisfaction with the website

Assessing the effectiveness of advertising on a web page

Remaining competitive

The clickstream data is the recording of what a user clicks on while browsing the web.

Such data is logged into the system and certain tags such as user agent, IP address, user account

id, id of requested page, name of search, feedback score of user, country id of buyer browsing

site are looked into to get a better idea. These clickstream messages are extracted at source

publisher and copied to Hadoop with real time support. A web interface has been provided where

specifics of environment, time range, pools etc. can be provided as depicted in Figure 2.1.

12

Figure 2.1 Web Interface

The tags were exposed in tabular and CSV format which was difficult to analyze and

hence OData support as shown in Figure 2.2 is added now through which an http endpoint can be

provided to data analysis software which helps in better understanding through visualization of

data. Tableau software has been used for data analysis.

13

Figure 2.2 OData Support

Instead of downloading the log files this particular endpoint can be given as input to

Tableau. The number of records is controlled by count parameter and is capped to 100,000

records. URL parameters index and count control sampling of records. Index parameter allows

one to specify record number to start with and count indicates number of records. For instance

using setting count as 100 would extract tags from first 100 records, setting it to 1000 would

extract tags from first 1000 records and so on. Also specifics such as extracting the top 100

records alone, specifying certain field by which one wants to filter and so on can be specified as

required. The manner through which OData link can be given to Tableau is depicted in Figure 2.3.

14

Figure 2.3 OData Connection

Tableau automatically reads the field names from the data and populates the dimensions

and measures areas. Measures contain numerical data like number of records fetched, while

dimensions contain categorical data, like page. The variables can be visually examined in just a

few clicks. Examples:

To know eBay’s frequented pages distinguished by the referrer through which it is

reached is depicted in Figure 2.4

To know through what devices the pages are visited mostly is depicted in Figure 2.5

15

Figure 2.4 Frequently visited pages

Figure 2.5 Common Devices

16

2.2 Need for Bot Analysis and Detection

The analysis is done to provide better insight about user behavior which has to be

accurate to produce good results so that improved results can be obtained in the future. But

it can be obscured by the presence of bots which would provide misdirection. The major

reasons to detect the bots upfront are:

Creates noise and also would report incorrect activity taking place on site

hindering web traffic analysis.

Concern about unauthorized deployment of bots for gathering business intelligence

from the site

Consume considerable network bandwidth at the expense of other users

Bots could be indicative of fraudulent behavior

2.3 Bot Analysis

This section deals with the basic description of bots, the reason of existence in both

positive or negative senses and its relevance to eBay.

2.3.1 Bots

Bots, also known as web robots, WWW robots or Internet bots, are autonomous systems

that send requests to web servers across the internet to request resources. Typically, bots perform

tasks that are both simple and structurally repetitive, at a much higher rate than would be

possible for a human alone. They are really good at performing repetitive tasks. The emergence

of the WWW (World Wide Web) as an information dissemination medium, along with the

availability of many web robot authoring tools have resulted in the rapid proliferation of web

robots unleashed onto the Internet today.

These robots are sent out to scour the web for various purposes. Bots can be used for

either good or malicious intent. The largest use of bots is in web spidering, in which an

17

automated script fetches, analyses and files information from web servers at many times the

speed of a human. Each server can have a file called robots.txt, containing rules for the spidering

of that server that the bot is supposed to obey or be removed. In addition to these uses, bots may

also be implemented where a response speed faster than that of humans is required (example,

gaming bots and auction-site robots) or less commonly in situations where the emulation of

human activity is required, for example chat bots. Bots are also being used as organization and

content

access applications for media delivery. Webot.com is an example of utilizing bots to deliver

personal media across the web from multiple sources. In this case the bots track content updates

on host computers and deliver live streaming access to a browser based logged in user.

Bots can be used with malicious motives also. A malicious bot is self-propagating

malware designed to infect a host and connect back to a central server or servers that act as a

Command and Control (C&C) center for an entire network of compromised devices, or botnet.

With a botnet, attackers can launch broad-based, remote-control, flood-type attacks against their

target(s). In addition to the worm-like ability to self-propagate, bots can include the ability to log

keystrokes, gather passwords, capture and analyze packets, gather financial information,

launch DoS attacks, relay spam, and open back doors on the infected host. Once infected, these

machines may also be referred to as zombies. Web bot fraud activity currently accounts for a

large number of web accesses. Cybercriminals make money from their botnets in several ways:

Cybercriminals may use the botnets themselves to send spam, phishing, or other scams

to trick consumers into giving up their hard earned money. They may also collect

information from the bot- infected machines and use it to steal identities, run up loan and

purchase charges under the user’s name.

Cybercriminals may use their botnets to create denial-of-service (DoS) attacks that

18

flood a legitimate service or network with a crushing volume of traffic. The volume may

severely slow down the company’s service or network’s ability to respond or it may

entirely overwhelm the company’s service or network and shut them down.

Cybercriminals may also lease their botnets to other criminals who want to send

spam, scams, phishing, steal identities, and attack legitimate websites, and networks.

It only takes minutes for an unprotected, internet connected computer to be infected with

malicious software and turned into a bot, underscoring the critical need for every computer and

smartphone user to have up-to-date security software on all their devices.

The main objectives of bots are as follows:

Collect statistics about the structure of web

Search engines rely on documents retrieved by bots to build their index databases

Site maintenance tasks such as checking for broken hyperlinks

Business organizations to collect email addresses and online resumes

As eBay is one of the largest ecommerce platform, there are a good number of cases

where automated systems (bots) scrape the eBay pages. These range from natural searches

(Google, Bing, Yahoo bots) to others use cases where they use the data for profit (or

automation). The bots for natural searches bring users back to eBay and thus considered as

friendly bots. The motivation for other bots (unfriendly ones) in specific to eBay probably lies

in a range in the following:

Getting competitive pricing

Influencing eBay view/ranking of their items by, example, increasing ranking of their

items to decreasing that of their competitors.

It is observed that perhaps more than 30% of pages hit on eBay marketplaces are with

automated systems. It should be noted that while the bot analysis considers automate system,

19

some of unfriendly activity likely takes place with real users doing activities in a way that

positively affects their feedback or ranking or negatively affects same of their competitors.

The bot detection rules are used by different systems and they have become stale over the

years and lack feedback cycle. The main goal of the proposed system is to come up with reliable

and improved bot detection logic with feedback cycle that can be used for detecting bots in

application layer. The application layer is important as bot detection is useful for dealing with

traffic shaving when infrastructure is not able to meet the user request demand in peak

conditions, experimentation, tracking and click stream analysis.

20

CHAPTER 3

BOT ANALYSIS AND DETECTION MECHANISMS

Detailed description regarding the existing systems used for doing the bot detection is

stated as it is commonly agreed that poor implementation can lead to severe network and server

overload problems and hence guidelines are needed to ensure that both Web Server and bot can

cooperate with each other in a way that is beneficial for both the parties. Further, the need for an

improved system has been highlighted.

3.1 BOT DETECTION TECHNIQUES

The general bot detection techniques used in the development of the project are elaborated

in the following section.

3.1.1 Robots.txt access check

The Robot Exclusion Standard was proposed to allow Web Administrators to specify

which part of their Websites is off- limits to visiting robots. According to this standard, whenever

a robot visits the website http://ebay.com, it should first examine the file

http://ebay.com/robots.txt. This file contains a list of access restrictions for eBay as specified by

the Web Administrators. An entry in the robots.txt file is as follows:

User-agent: *

Allow: /help/confidence/

Allow: /help/policies/

Disallow: *rt=nc

This suggests that web robots can be easily detected from sessions that access the

http://ebay.com/

http://ebay.com/robots.txt

http://ebay.com/robots.txt

21

robots.txt file. Indeed this is a reasonably good heuristic because eBay does not provide a direct

hyperlink to this file from any other html pages. As a result, most users are not aware of the

existence of this file. However this criteria alone is not very reliable because compliance to the

Robot Exclusion Standard is voluntary, and many robots simply do not follow the proposed

standard.

3.1.2 User Agent Check

A cooperative bot must declare its identity to the Web server via its user-agent field and

are called self-declared bots. For instance, the user agent field of the bots must contain the names

of the bot, specific key-words as a part of HTTP user-agent header, unlike the user agent field of

the Web browsers, which often contains the name Mozilla. In practice, not all bot designers

adhere to these guidelines. Some bots and even some browsers would use multiple user agent

fields in the same session. Example, many of the bots include bot or crawler as a part of user-

agent string. This can be used to detect whether a given request is from bot. A good number

of bots are today detected with self-declared bots with about 65% of bot activity due to these

systems. The following regular expressions such as spider, crawl, ktxn etc., are looked for in

user-agent string to detect whether activity is a bot or not.

The bot detection got more complicated as bot designers hid their identities by using the

same user agent field as standard browsers. In this situation it is impossible to detect the bots

based on user agent alone.

3.1.3 IP Address check

Another way to detect bots used is to match the IP address of a web client against those

of known bots. The problem with this method is that new bots kept cropping up and the database

is quite old and the list provided by certain websites turn out to be obsolete with the emerging

22

network. Another problem noticed with this approach is that the same IP address could be used

by users to search the listings or something similar and bots to automatically view all the items

on a specific page. This implies that this approach can be used only if the bot has been

previously identified. A common way to detect new bots is by examining the top visiting IP

addresses of the clients and manually verify the origin of each client. Unfortunately this

technique is found to be time consuming and often discovers bots that are already well known.

3.1.4 Count of HEAD requests or HTTP requests with unassigned referrers

The guidelines for bot designers also suggest that ethical bots should

Moderate their rate of information acquisition

Operate only when their server is lightly loaded

Use the head request method whenever possible

The request method of an HTTP request message determines the type of action the web

server should perform on the resource requested by the web client. A web server responds to a

GET request by sending a message, containing of some header information along with a message

body, which contains the requested file. On the contrary the response to the HEAD request

incurs less overhead because it conta ins only the message header. Due to this reason bots are

encouraged to use the HEAD request method. In principle, one can examine a session with a

large number of head requests to discover ethical bots.

In addition, sessions having a large number of requests with unassigned referrer fields are

looked into. The referrer field is provided by the HTTP protocol to allow the web client to

specify the address of the web page that contains the link of the client followed in order to reach

the current requested page. For example, whenever a user requests for a particular type of mobile

by clicking on hyperlink found in the deals page, the user’s browser will generate an HTTP

request message with its referrer field assigned to the deals page. Since most bots do not assign

23

any value to their referrer fields, these values appear as “-” in the Web Server logs.

Both of these heuristics are found to be not entirely reliable because non bot sessions can

sometimes generate HEAD request messages (example, when proxy servers attempt to validate

their cache contents) or HTTP messages with unassigned referrer values (example, when a user

clicks on a bookmarked page or types in a new URL in the address window) and hence this

cannot be considered for bot detection.

3.2 LOG PROCESSING AND BOT DETECTION

The Analysis team performs offline processing for user click-stream data. In this system,

bot data is excluded from much of processing as it creates noise and also would report incorrect

activity taking place on site. The user click-stream data and analysis is important as it is source

for capturing results of experiments via A/B test. A/B testing is done to serve the user a page of

type A or type B based on the analysis of the user’s activity. The log tracking today captures

activity of bot as the click stream data is produced via server side tracking. It is expected that

conducting client side tracking will simplify this part of the problem many bots likely wouldn't

execute any of client side scripts. To determine the bots in a more accurate way their behavior is

observed.

3.2.1 Behavioral bots

Sometimes, the automated systems (unfriendly bots) tend to shield behind a real looking

user-agents. Since these bots cannot be detected by looking for standard user-agent string, the

self-declared bot detection mechanism doesn't work for these cases. In order to detect these bots,

behavior of a set of requests are looked into, to determine whether it is associated with

automated system. The way automated systems are identified is using a set of rules. The rules are

based on certain patterns with Hourly bots looking at events within single session like dwell time

in activity within session, large number of events in single session, sessions with large number of

24

Captcha challenges etc., while the daily bots look at events across multiple sessions such as large

number of single event sessions per agent, only a few IPs with each having single click session

per Agent, all sessions of given IP having single click session.

Bot detection based on user agent string and sessions having large number of events are

responsible for detecting most of the bots with total bots detected by these rules being more than

95-98%.

3.3 TRACKING BOT DETECTION

The tracking system performs bot detection. The detection mechanism as described

above is majorly based on certain user-agent patterns and IP ranges based on a list. The bot rules

used by tracking platform are stored in database and these are used to determine A/B tests

(example, when to redirect traffic to other pools, try out variations in an application, etc.). The

bot rules have been created a while ago work almost as a static list (even though there is

provision to update DB and thus not require code changes).

3.4 LIMITATION OF CURRENT BOT DETECTION

The analysis done sheds a lot of light on the areas where bots are missed out. The

drawbacks related to the major two techniques that detect most of the bots are elaborated.

3.4.1 User Agent Based Bots

The user-agent based bot detection as seen is primary way of detecting bots. However,

user-agent strings are not evolved as these rules (which regex patterns to use) have been in use

for many years. Example, The rules detect for user-agent of msnbot which is not

frequented anymore. Instead, it does not use bingbot. The constant evolution of these rules in a

periodic manner will help deal with such limitations. Hence evolving the bot-detection

mechanism by looking at additional rules that can be formed by bot analysis is looked into.

25

3.4.2 Behavioral Bots

The log processing use behavioral bots as user-agent based ones go only thus far. In the

log processing, many of bot-detection rules are session oriented and session attributes like

whether single click sessions take place is important aspect of this logic. The single click

sessions are equivalent of having new guid being used every time as the cookies are not passed

by the automated system (or, click- id is not incremented due to lack of JavaScript).

These conditions are easier to check directly by looking at unique set of session- id for a

given user agent. Similarly, the presence or absence of JavaScript can be directly checked by

whether the java-script is supported by user-agent.

3.5 MOTIVATION FOR BOT ANALYSIS AND IMPROVED DETECTION

To overcome the mentioned drawbacks, a more robust technique is needed to

identify visits by camouflaging and previously unknown bots. The current bot detection

algorithms make use of user-agent for good number of observation. The behavioral bots look at

IP addresses in addition to user-agent string and guid (for sessionization) to determine whether

request is bot- generated or not. However, these do not look at additional attributes for making

this determination. These additional aspects are percentage of requests with referrer,

percentage of requests with signed in users, percentage requests with unique guids, percentage

of requests with JS enabled, percentage of requests with unique click id. Also, the top and

bottom of funnel can be looked to determine bots. Example, the search pages may have different

set of user-agent requesting than say checkout pages. The difference in this user-agent may

indicate cases where we have bot as these may be hitting pages at the top of funnel (search

pages) but not hitting those at the bottom of the funnel (checkout). These additional attributes

can be used to create better and more targeted bot detection rules.

26

3.5.1 LACK OF FEEDBACK MECHANISM

The method of detecting whether a request is coming from a bot is being done based on

static rules. This mechanism can be evolved by looking at bots and recent past and feeding that

back to the system to determine the request as bot. Also, this feedback mechanism can be within

application-server processing logic so that other downstream application and systems can benefit

from this data. Also, capturing them in real-time has benefits with traffic shaping and can also

help with reduced server infrastructure.

The feedback mechanism will also help with detecting bots faster as the log processing

today has to collect data for past 24 hours. This processing generates significant delay in current

log processing pipeline as end-of-day processing takes a long time. Having this feedback cycle

will help detect many of the bots at the front-end server itself. The detection of bots at the front-

end will also help with traffic shaving by reducing the amount of extra capacity that many need

to be built out.

27

CHAPTER 4

EXPERIMENTS AND RESULTS

The additional analysis that has been done on the log data and ways in which the bot

detection mechanism has been tremendously improved as its result has been elaborated.

4.1 BOT ANALYSIS SETUP

The bot analysis is done by obtaining a list of web requests and capturing user-agent,

referrer, IP address and LOG tags emitted. The data are captured via Pig based MapReduce jobs

and CAL regular expression searches that captures this data from CAL logs. The analysis is

done for a period of time (for several hours) and analysis was done by sampled set of data

followed by full data for period of time. The sampled data (using regex options) was used to

understand patterns and later the analysis is repeated on full set. The analysis is done for

different applications like search, view item, checkout to perform traffic comparison between

them.

4.2 BOT DETECTION SYSTEM DESIGN

The proposed bot detection design includes the feedback mechanism as depicted in

Figure 4.1 which is lacking in the existing system. This bot detection mechanism is going to

fetch the data from the Hadoop HDFS and run periodic MapReduce jobs on it to determine the

bot- signature. In the feedback logic, it is important to detect bot-signature so the newer requests

can be appropriately tagged. The signature would be to identify requests that lack corresponding

conversion. The application server can hit this service upfront to detect the bots more efficiently.

To determine the bots on a real time basis jet stream is proposed for near real time analysis and

generation of bot signatures.

28

Figure 4.1 Bot Detection System Design

4.3 CHECKOUT AND SEARCH PAGE COMPARISION

The search tends to be top of the funnel where users tend to start with. The checkout

tends to be bottom of funnel where transactions are completed. As many of the bots are for

gaining competitive information, it is expected that bots may be seen by search pages and not by

checkout pages. The comparison of user-agents seen by checkout and search page show that, as

expected, majority of bots are detectable by comparing data between search and checkout pages.

In fact, this system is able to detect additional bots than the current statically maintained bot-

rules are able to detect. The graph depicted in Figure 4.2 shows that comparing data between

checkout and search pages may be able to detect 30-40% higher number of bots than something

that is currently possible based on LOG or Firemarshal based bot rules. This simply means bot

detection by looking at top and bottom of funnel can be useful way of knowing user-agents that

29

are potential bots. In fact, this comparison detects not only the current bot rules but also

additional user-agents that are not found with current static set of rules maintained by LOG

and Firemarshal.

Figure 4.2 Comparison of different mechanisms

4.4 ANALYSIS WITH ADDITIONAL ATTRIBUTES

This analysis focuses on looking at additional attributes that can be used for bot analysis.

The user-agent was primarily used for bot detection. The additional attributes are part of LOG

event and the analysis has been done for the following:

Referrer: The requests that have referrer header included.

Unique IP count: The requests that have unique IP addresses

Unique user count: The requests that have users accessing the page

Unique best user count: The requests that have user and best user present

30

Unique guid count: The requests that have unique guids

JS presence: The requests that have cookie indicating that the agent has JS support

Unique click id: The requests that have unique click id on page

Checkout conversion: The fraction of checkout traffic for a given user-agent (%). This

compares the search to checkout traffic.

The analysis shows that bots detected using this mechanism match with those detected

with checkout based analysis. Particularly, the unique number of users accessing the page for a

given user-agent is good way of detecting whether or not that user-agent is bot. The strong

correlation that is seen is almost uncanny and points to:

Bots don't often modify user-agent

Bots tends to use old user-agent strings

Some user-agent strings are from real as well as bot traffic. Additional analysis will be

needed to separate them.

4.5 FINDING LOGGED IN USERS

The analysis is done to determine the percentages of users that are signed in. As the

analysis with additional attributes happens to capture this detail, it is easy to analyze this pattern.

The data shows that 75-80% of users are either logged in or previously logged-in for search

pages (with "u" or "bu" tag). As the search happens to be top of the funnel, this number is, as

expected, higher when looking at checkout and other pages. This is useful finding as it was

thought to be smaller % based on earlier analysis. The 75-80% signed-in users mean that user

based sessionization, tracking and experimentation as real possibility.

31

4.6 ANALYSIS OVER TIME

In this analysis the bots are determined based on the number of unique user requests

coming in for a particular user-agent for a particular hour. This process is repeated over a period

of time to classify the requests as bots or not. Now analysis is done to determine how often the

bots change, i.e., if the same list of bots are used after a day or week what percentage of bots can

be detected using the old list assuming at the 1st hour all the bots are detected.

The following graphs show the results of analysis done for a period of 2 weeks in search page

depicted in Figure 4.3 and view item page depicted in Figure 4.4.

Figure 4.3 Search page result

32

Figure 4.4 View Item page result

4.7 INSPECTION OF DATA

Further drill down of the data obtained shows a lot of insights.

4.7.1 Inclusion of IP

User agents that are having a small ratio of unique users were marked as bots initially.

These include most of the bots but certain agents seem to have both genuine user requests as well

as bot requests coming in. To resolve this, only for such a mix of agents a further drill down

based on IP is done. This helps us to distinguish between requests coming from bots and actual

users when they are coming through different IP’s and hence improve the accuracy of our

findings.

33

4.7.2 Signed in users

Out of the non bot users, certain agents are mis-tagged. For such agents signed in users

are found to classify them accurately. To improve the classification capability agents where java

script is not executed is calculated which determines the signed in users in a better way.

4.8 EVALUATION OF THE CURRENT REAL-TIME BOT DETECTION

The tracking library performs real-time up front bot detection at the application servers.

They use a combination of user agent and IP address against a set of known bot signatures. To

evaluate the performance of this detection mechanism, i.e., to determine how much of the actual

bot traffic these methods are able to flag, the logs flagged by the business intelligence listeners

are used as a reference point. The business intelligence uses an elaborate detection

mechanism based on sessionization and applying an elaborate set of rules on them.

There is a huge difference observed between the existing real time detection and the

business intelligence mechanism used offline to tag the bots. Some investigations are done to

analyze the gap so that the bot detection mechanism can be improved.

4.8.1 Targeting pages

Certain user agents are marked as bots as they do not provide a user agent field. On

looking at the pages through which these requests come in, it is found that they are having a

null value in the user agent field because of certain ajax calls that are made internally. The

relation between the page ids and page names is not one to one and a lot of mapping has to be

done to make sense of the pages. These observations reveal the reason behind certain agents that

have requests coming in from such pages being marked as bot. Hence the difference can be

reduced by looking at the requested pages.

34

4.8.2 Inspecting bot rules

To get a better understanding of this difference the bot rules through which the offline

processing is done has to be scrutinized. Based on the result the upfront detection logic can be

improved by understanding and applying certain rules by sessionizing and detecting the bots in

an improved fashion.

35

CHAPTER 5

CONCLUSION AND FUTURE DIRECTION

Initial findings show that roughly 31% of the bot traffic is tagged in real- time. The reason

behind this low percentage is the shortcomings of the existing methods. Upon analysis of the

data it could be seen that some of the self-declared bots are missed by the tracking library,

possibly because of deprecated bot signatures. This difference alone amounts to 157,662363

records, i.e. approximately 14% increase can be achieved just by adding the user agents of these

self-declared bots to the bot signature set. Further analysis also revealed that by considering the

techniques that have been analyzed above a steep increment was observed in the detection of

bots in an upfront manner. A considerable amount of around 34% increase was obtained by using

the additional techniques that have been analyzed over the course of time. This has proved to be

quite useful as detection of bots upfront in the application server will help in reducing the load

that has to be processed by the UBI down the stream.

The bot detection is needed in application server to allow different systems to use this

information within application server (example, Firemarshal, Tracking Platform, etc.). This

will help to avoid duplication of effort on bot detection and to reduce downstream processing for

offline processing systems like UBI processing and Private Eyes.

The existing system was not very efficient and was detecting a meager 31% and was

missing out many obvious and complex bots that were scrapping the eBay pages. The feedback

based mechanism will help with keeping up with changing bot structure. In the feedback

logic, it is important to detect bot-signature so the newer requests can be appropriately

36

tagged. The signature would be to identify requests that lack corresponding conversion. These

can be done by periodic batch jobs (hourly or daily) and on real-time basis. The in depth analysis

has thrown a lot of light into where exactly it was missing out and also provided improvements

which further enhanced the bot detection mechanism in the application server and the goal of

80% bot detection is achieved.

Another future direction of analytical research done would be to sessionize the requests

to obtain further insights at the application server and enhance the bot detection mechanism

further. The tracking data for bots can be further split and transported separately from non-bot

data. This can be done primarily to reduce the amount of processing needed for downstream

processing and provide higher reliability and priority for non-bot data. The split of this data can

be done downstream of publisher so that non-bot data can be sent on a parallel channel compared

to bot data.

37

BIBLIOGRAPHY

REFERENCES

Allan Gates, “Programming Pig”, O’Reilly Media, Inc., 2011.

Douglas Brewer and Calton Pu, A Link Obfuscation Service to De tec t Webbots. IEEE

Services Computing (SCC), IEEE Computer Society, (2010)

Tan, Pang-Ning, and Vipin Kumar, Discovery of web robot sessions based on their

navigational patterns. Intelligent Technologies for Information Analysis. Springer Berlin

Heidelberg, 2004. 193-222.

WEBSITES

http://wiki2.arch.ebay.com/display/YELLOSTONE/Home

http://wiki2.arch.ebay.com/display/YELLOSTONE/SOJTagExtrration

http://wiki2.arch.ebay.com/display/YELLOSTONE/BotDetectionandAnalysis

http://wiki2.arch.ebay.com/display/YELLOSTONE/BotAnalysis

http://wiki2.arch.ebay.com/display/CAL/Home

http://Git-scm.com/documentation

http://maven.apache.org/

http://viralpatel.net/blogs/introduction-apache-maven-build-framework-build-automation-

tool/

http://Hadoop.apache.org/

http://Pig.apache.org/

https://sites.google.com/site/winstoninada/playing-with-pig


http://blog.credera.com/technology- insights/java/jenkins-continuous- integration/

http://wiki2.arch.ebay.com/display/YELLOSTONE/Home

http://wiki2.arch.ebay.com/display/YELLOSTONE/SOJTagExtrration

http://wiki2.arch.ebay.com/display/YELLOSTONE/BotDetectionandAnalysis

http://wiki2.arch.ebay.com/display/YELLOSTONE/BotAnalysis

http://wiki2.arch.ebay.com/display/CAL/Home

http://git-scm.com/documentation


http://hadoop.apache.org/

http://pig.apache.org/


Documents

Bot Analysis EBay