Upload
aparna
View
23
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Bot Analysis is one of the most trending research areas in giant retail and many more industries.
Citation preview
1
CHAPTER 1
INTRODUCTION
A brief description about the organization in which the project is undertaken is described
in this chapter. The development environment of the system is also specified in this chapter. It
gives a brief introduction about the various technologies and tools used in the development of the
system.
1.1 ORGANIZATION PROFILE
eBay, is a global online marketplace where anyone can trade anything. eBay is a platform
for the sale of goods and services by a diverse community of individuals and businesses. It is
one of the world's premier online properties and the corporate home of a number of successful
Internet brands having a global presence in 39 markets, including the U.S. With more than 222
million registered members from all around the world and nearly 120 million active users,
eBay's core marketplace site hosts millions of retail and wholesale transactions in some 30
countries every day.
Headquartered in San Jose, California, eBay corporation has two core businesses, a
payments business of which PayPal is the flagship brand and a marketplaces business of which
the eBay website along with its various international versions are the flagship brand. Founded in
1995 by 28-year-old Internet entrepreneur and the current Chairman Pierre Omidyar, eBay is
headed today by the President and CEO John Donahoe.
At any time, there are approximately 100 million listings on eBay worldwide, and
approximately 6.6 million listings are added per day. eBay users trade in more than 50,000
categories including collectibles, antiques, sports memorabilia, computers, IT and office, art,
2
antiques, toys, dolls, stamps, comics, magazines, music, pottery, glass, photography, electronics
and gemstones. It features a variety of international sites, specialty sites, categories and services
that aim to provide users with the necessary tools for efficient online trading in the auction-style
and fixed price formats. The Company has built specialized experiences for certain vertical
formats, such as Daily Deals, Fashion, Motors (vehicles, parts and accessories), and Electronics.
eBay India (formerly Baazee.com) was launched in India in 2000 and has become one of
India's leading online shopping destinations. Although eBay is a global company, eBay.in offers
a trading platform tailored to the unique needs of Indians. eBay India has a community of over 5
million registered users. These users come from over 4,306 cities in India. At any given time,
there are over 2 million live listings on eBay India (www.ebay.in) across 2,000 categories of
products in Electronics, Lifestyle, Collectibles and Media verticals.
Approximately 30,000 sellers sell on eBay India annually. eBay India can be accessed on the
mobile web on m.ebay.in and via a suite of Mobile Apps for the iPhone, the iPad, the Android,
Windows and Nokia.
The collective impact of eBay on e-commerce is staggering: In the second quarter of
2013, eBay Inc. enabled more than $51 billion of Commerce Volume (ECV). ECV is the total
commerce and payment volume across all three business units consisting of Marketplaces GMV,
PayPal merchant services net total payment volume and GSI Global Ecommerce (GeC)
merchandise sales. eBay.com users worldwide trade $2,352 worth of goods on the site every
second.
3
1.2 TEAM PROFILE
The YELLOWSTONE project plans to address many of the log data processing or
analyzing issues present currently like
Multiple hops for data
Non-guaranteed messaging
Long processing times
Limited processing throughput
Data quality concern
As a part of this, the filtered log data (logged by the user and the system) is to be
stored to Hadoop in a small latency. This raw data then can be used to perform bot-detection as
appropriate. The log records contain multiple tags (key value pairs) that are of interest for
behavioral analysis. User would want to know all log tags for a given time period in a specific
environment in order to perform meaningful analysis. A web interface is provided that can be
used to perform requests, view tags in tabular form and export tags in raw data or OData format.
OData is being used to expose information and make it available to host of customers
through an http endpoint. Extracted raw data could be imported in analytic tools like tableau.
With the data exposed as OData as well as CSV (Comma Separated Values) files it can be
connected to Tableau easily and start analyzing the data.
The Yellowstone team focuses on overcoming the shortcomings present in the data
processing by bringing down the number of nodes and reducing both noise and time delay at
which the data is available for analyzing. The data collection source is a logger publisher from
where certain metrics are stored to a persistent location in Hadoop as shown in the
architecture diagram depicted in Figure 1.1. The architecture has moved from data being
stored in filers to Hadoop which has reduced the cost to a very great extent. Also, with the
4
shift to Hadoop the data retention period has increased to 120 days. With the number of
hops through which the data passes being reduced the failure rate has decreased and also
the latency period has shortened from over a day to around 20 minutes.
Figure 1.1 Architecture Diagra m
5
1.3 SYSTEM CONFIGURATION
System configuration refers to the software and hardware configuration used in the
development of the project.
1.3.1 Hardware Configuration
Processor : Intel® Core(TM) i7-2600
Clock Speed : 3.40GHz
RAM : 8GB
Hard Disk : 500GB
1.3.2 Software Configuration
Platform : Windows7 Enterprise
Language : JAVA
Tools : Eclipse Juno, Apache Maven, Tableau, Excel
Technologies : Apache Hadoop, Apache Pig, OData
Web Application Server : Tomcat
1.4 TOOLS AND TECHNOLOGIES USED
This section deals with the tools and technologies used in the development of the project.
1.4.1 Java
Java has been tested, refined, extended, and proven by a dedicated community. And
numbering more than 6.5 million developers, it is the largest and most active on the planet. With
its versatility, efficiency, and portability, Java has become invaluable to developers by enabling
them to write software on one platform and run it on virtually any other platform, create
programs to run within a web browser and web services, develop server-side applications for
online forums, stores, polls, HTML forms processing, and more, combine applications or
6
services using the Java language to create highly customized applications or services, write
powerful and efficient applications for mobile phones, remote processors, low-cost consumer
products, and practically any other device with a digital heartbeat and so on.
1.4.2 Eclipse
Eclipse is an open source community, whose projects are focused on building an open
development platform comprised of extensible frameworks, tools and runtimes for building,
deploying and managing software across the lifecycle. It is written primarily in Java. A large
and vibrant ecosystem of major technology vendors, innovative start-ups, universities, research
institutions and individuals extend, complement and support the Eclipse platform.
1.4.3 RIDE
RIDE (Raptor Integrated Development Environment) is eBay's Eclipse-based
development environment. It includes Eclipse IDE, custom eBay and open source eclipse
Plugins, the Geronimo Application Server, and IBM's JDK. It is the required environment for
developers writing Java-based applications for eBay.com/Marketplace. It is written primarily in
Java and can be used to develop applications in Java and, by means of the various plugins, in
other languages as well.
1.4.4 Github
Github is a web-based hosting service for software development projects that use the Git
revision control system. It also provides social networking functionality such as feeds, followers
and the social network graph to display how developers work on their versions of a repository.
Github provides both GUI and command line tool (Git Bash) for accessing the Git. The flagship
functionality of GitHub is forking – copying a repository from one user’s account to another. This
enables a user not having write access to a project to fork it and modify it under a registered
account. If the changes made are wished to be shared, a notification called a pull request can be
7
sent to the original owner. That user can then, with a click of a button, merge the changes found in
the sharer’s repository with the original repository.
1.4.5 Apache Maven
Apache Maven is a build automation tool used primarily for Java projects. Maven
serves a similar purpose to the Apache Ant tool, but it is based on different concepts and
works in a different manner. Maven is essentially a project management and comprehension
tool and as such provides a way to help with managing Builds, Documentation, Reporting,
Dependencies, SCMs, Releases and Distribution.
The Maven project is hosted by the Apache Software Foundation, where it was
formerly part of the Jakarta Project. Maven's primary goal is to allow a developer to
comprehend the complete state of a development effort in the shortest period of time. In order to
attain this goal there are several areas of concern that Maven attempts to deal with:
Making the build process easy
Providing a uniform build system
Provide an easy way to see the health and status of a project
Providing guidelines for best practices development
Allowing transparent migration to new features
Preventing inconsistent setups and focus energy on writing applications
1.4.6 Tableau Software
Tableau is business intelligence software that allows anyone to connect to data in a few
clicks, then visualize and create interactive, sharable dashboards with a few more. It is easy
to use and powerful enough to satisfy even the most complex analytical problems. It is data
analysis software that lets one to just drag and drop to create interactive graphs from any data.
8
Tableau automatically reads the field names from the data and populates the dimensions and
measures areas. Measures contain numerical data like number of records fetched, w h ile
dimensions contain categorical data. The variables can be examined visually in just a few clicks
and embedded on a website or a blog easily. The findings can be securely shared with others
easily within seconds.
1.4.7 OData
OData is a standard for providing access to data over the internet. It has been developed
by Microsoft as an open specification. It is a Web protocol for querying and updating data that
provides a way to unlock the data and free it from silos that exist in applications today. OData
does this by applying and building upon pre-existing internet protocols, which means that web
developers can use it in their applications with a much easier learning curve.
OData provides access to just the data you want. It has scalability built right in to the
protocol. Using the conventions of OData, highly specific requests can be made to get a single
data item or relationships can be quickly uncovered using the feature of linked-data.
OData works just like the web: each record returned by an OData request contains links to other
related records in exactly the same way as web pages contain hyperlinks to other web
pages. The protocol also allows publishers to tag and label data with domain specific
vocabularies.
OData also supports a very advanced query language that enables filtering and ordering
data based on any criteria. OData also has support for advanced features like
asynchronous queries and requesting just the changes since the last query. These features
dramatically improve the speed of queries against large datasets. OData also provides a uniform
way to represent metadata about the data, allowing computers to know more about the type
system, relationships and structure of the data.
9
1.4.8 Apache Hadoop
The Apache Hadoop is an open source software framework that allows for the distributed
processing of large data sets (structured and unstructured) across clusters of computers
using simple programming models. It is designed to scale up from single servers to thousands of
machines, with a very high degree of fault tolerance, each offering local computation and
storage. The Hadoop framework transparently provides both reliability and data motion to
applications. Rather than relying on high-end hardware, the resiliency of these clusters
comes from the software’s ability to detect and handle failures at the application layer.
Hadoop implements a computational paradigm named MapReduce where the application
is divided into many small fragments of work, each of which can execute or re-execute on any
node in the cluster. In addition, it provides a distributed file system that spans all the nodes in a
Hadoop cluster for data storage, providing very high aggregate bandwidth across the
cluster. It links together the file systems on many local nodes to make them into one big file
system. Hadoop Distributed File System (HDFS) assumes nodes will fail, so it achieves
reliability by replicating data across multiple nodes.
1.4.9 Apache Pig
Apache Pig is a high- level platform developed to simplify querying large data sets in
Apache Hadoop using MapReduce programs. Pig is made up of two components: the first is the
language itself, which is called PigLatin, and the second is a runtime environment where
PigLatin programs are executed. PigLatin is a flow language which abstracts from the Java
MapReduce idiom into a form similar to SQL. PigLatin defines a set of transformations on a
data set such as aggregate, join and sort. Pig translates the PigLatin script into MapReduce so
that it can be executed within Hadoop. PigLatin is sometimes extended using UDFs (User
10
Defined Functions), which the user can write in Java or a scripting language and then call
directly from the PigLatin. The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables them to handle very large data
sets.
1.4.10 Jenkins
Jenkins is an open source continuous integration tool written in Java. Jenkins provides
continuous integration services for software development, primarily in the Java programming
language. It is a server-based system running in a servlet container such as Apache Tomcat. It
supports windows batch commands, shell scripts and can execute Apache Ant and Apache
Maven based projects.
11
CHAPTER 2
OVERVIEW OF BOTS
The study done on the log records for behavioral analysis and the presence of bots in the
data is detailed in this chapter. Further, the need to detect the bots upfront in the application
server has been highlighted.
2.1 BACKGROUND
The business metrics collected as part of the team must be analyzed in depth to define
and observe user behavior. Analysis of various attributes provides insight on the general
behavior of a user which is highly necessary due to reasons such as:
Creating a user profile that aids in understanding the types of people that visit the
website.
Predicting whether a customer is likely to purchase through our website
Improving customer satisfaction with the website
Assessing the effectiveness of advertising on a web page
Remaining competitive
The clickstream data is the recording of what a user clicks on while browsing the web.
Such data is logged into the system and certain tags such as user agent, IP address, user account
id, id of requested page, name of search, feedback score of user, country id of buyer browsing
site are looked into to get a better idea. These clickstream messages are extracted at source
publisher and copied to Hadoop with real time support. A web interface has been provided where
specifics of environment, time range, pools etc. can be provided as depicted in Figure 2.1.
12
Figure 2.1 Web Interface
The tags were exposed in tabular and CSV format which was difficult to analyze and
hence OData support as shown in Figure 2.2 is added now through which an http endpoint can be
provided to data analysis software which helps in better understanding through visualization of
data. Tableau software has been used for data analysis.
13
Figure 2.2 OData Support
Instead of downloading the log files this particular endpoint can be given as input to
Tableau. The number of records is controlled by count parameter and is capped to 100,000
records. URL parameters index and count control sampling of records. Index parameter allows
one to specify record number to start with and count indicates number of records. For instance
using setting count as 100 would extract tags from first 100 records, setting it to 1000 would
extract tags from first 1000 records and so on. Also specifics such as extracting the top 100
records alone, specifying certain field by which one wants to filter and so on can be specified as
required. The manner through which OData link can be given to Tableau is depicted in Figure 2.3.
14
Figure 2.3 OData Connection
Tableau automatically reads the field names from the data and populates the dimensions
and measures areas. Measures contain numerical data like number of records fetched, while
dimensions contain categorical data, like page. The variables can be visually examined in just a
few clicks. Examples:
To know eBay’s frequented pages distinguished by the referrer through which it is
reached is depicted in Figure 2.4
To know through what devices the pages are visited mostly is depicted in Figure 2.5
15
Figure 2.4 Frequently visited pages
Figure 2.5 Common Devices
16
2.2 Need for Bot Analysis and Detection
The analysis is done to provide better insight about user behavior which has to be
accurate to produce good results so that improved results can be obtained in the future. But
it can be obscured by the presence of bots which would provide misdirection. The major
reasons to detect the bots upfront are:
Creates noise and also would report incorrect activity taking place on site
hindering web traffic analysis.
Concern about unauthorized deployment of bots for gathering business intelligence
from the site
Consume considerable network bandwidth at the expense of other users
Bots could be indicative of fraudulent behavior
2.3 Bot Analysis
This section deals with the basic description of bots, the reason of existence in both
positive or negative senses and its relevance to eBay.
2.3.1 Bots
Bots, also known as web robots, WWW robots or Internet bots, are autonomous systems
that send requests to web servers across the internet to request resources. Typically, bots perform
tasks that are both simple and structurally repetitive, at a much higher rate than would be
possible for a human alone. They are really good at performing repetitive tasks. The emergence
of the WWW (World Wide Web) as an information dissemination medium, along with the
availability of many web robot authoring tools have resulted in the rapid proliferation of web
robots unleashed onto the Internet today.
These robots are sent out to scour the web for various purposes. Bots can be used for
either good or malicious intent. The largest use of bots is in web spidering, in which an
17
automated script fetches, analyses and files information from web servers at many times the
speed of a human. Each server can have a file called robots.txt, containing rules for the spidering
of that server that the bot is supposed to obey or be removed. In addition to these uses, bots may
also be implemented where a response speed faster than that of humans is required (example,
gaming bots and auction-site robots) or less commonly in situations where the emulation of
human activity is required, for example chat bots. Bots are also being used as organization and
content
access applications for media delivery. Webot.com is an example of utilizing bots to deliver
personal media across the web from multiple sources. In this case the bots track content updates
on host computers and deliver live streaming access to a browser based logged in user.
Bots can be used with malicious motives also. A malicious bot is self-propagating
malware designed to infect a host and connect back to a central server or servers that act as a
Command and Control (C&C) center for an entire network of compromised devices, or botnet.
With a botnet, attackers can launch broad-based, remote-control, flood-type attacks against their
target(s). In addition to the worm-like ability to self-propagate, bots can include the ability to log
keystrokes, gather passwords, capture and analyze packets, gather financial information,
launch DoS attacks, relay spam, and open back doors on the infected host. Once infected, these
machines may also be referred to as zombies. Web bot fraud activity currently accounts for a
large number of web accesses. Cybercriminals make money from their botnets in several ways:
Cybercriminals may use the botnets themselves to send spam, phishing, or other scams
to trick consumers into giving up their hard earned money. They may also collect
information from the bot- infected machines and use it to steal identities, run up loan and
purchase charges under the user’s name.
Cybercriminals may use their botnets to create denial-of-service (DoS) attacks that
18
flood a legitimate service or network with a crushing volume of traffic. The volume may
severely slow down the company’s service or network’s ability to respond or it may
entirely overwhelm the company’s service or network and shut them down.
Cybercriminals may also lease their botnets to other criminals who want to send
spam, scams, phishing, steal identities, and attack legitimate websites, and networks.
It only takes minutes for an unprotected, internet connected computer to be infected with
malicious software and turned into a bot, underscoring the critical need for every computer and
smartphone user to have up-to-date security software on all their devices.
The main objectives of bots are as follows:
Collect statistics about the structure of web
Search engines rely on documents retrieved by bots to build their index databases
Site maintenance tasks such as checking for broken hyperlinks
Business organizations to collect email addresses and online resumes
As eBay is one of the largest ecommerce platform, there are a good number of cases
where automated systems (bots) scrape the eBay pages. These range from natural searches
(Google, Bing, Yahoo bots) to others use cases where they use the data for profit (or
automation). The bots for natural searches bring users back to eBay and thus considered as
friendly bots. The motivation for other bots (unfriendly ones) in specific to eBay probably lies
in a range in the following:
Getting competitive pricing
Influencing eBay view/ranking of their items by, example, increasing ranking of their
items to decreasing that of their competitors.
It is observed that perhaps more than 30% of pages hit on eBay marketplaces are with
automated systems. It should be noted that while the bot analysis considers automate system,
19
some of unfriendly activity likely takes place with real users doing activities in a way that
positively affects their feedback or ranking or negatively affects same of their competitors.
The bot detection rules are used by different systems and they have become stale over the
years and lack feedback cycle. The main goal of the proposed system is to come up with reliable
and improved bot detection logic with feedback cycle that can be used for detecting bots in
application layer. The application layer is important as bot detection is useful for dealing with
traffic shaving when infrastructure is not able to meet the user request demand in peak
conditions, experimentation, tracking and click stream analysis.
20
CHAPTER 3
BOT ANALYSIS AND DETECTION MECHANISMS
Detailed description regarding the existing systems used for doing the bot detection is
stated as it is commonly agreed that poor implementation can lead to severe network and server
overload problems and hence guidelines are needed to ensure that both Web Server and bot can
cooperate with each other in a way that is beneficial for both the parties. Further, the need for an
improved system has been highlighted.
3.1 BOT DETECTION TECHNIQUES
The general bot detection techniques used in the development of the project are elaborated
in the following section.
3.1.1 Robots.txt access check
The Robot Exclusion Standard was proposed to allow Web Administrators to specify
which part of their Websites is off- limits to visiting robots. According to this standard, whenever
a robot visits the website http://ebay.com, it should first examine the file
http://ebay.com/robots.txt. This file contains a list of access restrictions for eBay as specified by
the Web Administrators. An entry in the robots.txt file is as follows:
User-agent: *
Allow: /help/confidence/
Allow: /help/policies/
Disallow: *rt=nc
This suggests that web robots can be easily detected from sessions that access the
21
robots.txt file. Indeed this is a reasonably good heuristic because eBay does not provide a direct
hyperlink to this file from any other html pages. As a result, most users are not aware of the
existence of this file. However this criteria alone is not very reliable because compliance to the
Robot Exclusion Standard is voluntary, and many robots simply do not follow the proposed
standard.
3.1.2 User Agent Check
A cooperative bot must declare its identity to the Web server via its user-agent field and
are called self-declared bots. For instance, the user agent field of the bots must contain the names
of the bot, specific key-words as a part of HTTP user-agent header, unlike the user agent field of
the Web browsers, which often contains the name Mozilla. In practice, not all bot designers
adhere to these guidelines. Some bots and even some browsers would use multiple user agent
fields in the same session. Example, many of the bots include bot or crawler as a part of user-
agent string. This can be used to detect whether a given request is from bot. A good number
of bots are today detected with self-declared bots with about 65% of bot activity due to these
systems. The following regular expressions such as spider, crawl, ktxn etc., are looked for in
user-agent string to detect whether activity is a bot or not.
The bot detection got more complicated as bot designers hid their identities by using the
same user agent field as standard browsers. In this situation it is impossible to detect the bots
based on user agent alone.
3.1.3 IP Address check
Another way to detect bots used is to match the IP address of a web client against those
of known bots. The problem with this method is that new bots kept cropping up and the database
is quite old and the list provided by certain websites turn out to be obsolete with the emerging
22
network. Another problem noticed with this approach is that the same IP address could be used
by users to search the listings or something similar and bots to automatically view all the items
on a specific page. This implies that this approach can be used only if the bot has been
previously identified. A common way to detect new bots is by examining the top visiting IP
addresses of the clients and manually verify the origin of each client. Unfortunately this
technique is found to be time consuming and often discovers bots that are already well known.
3.1.4 Count of HEAD requests or HTTP requests with unassigned referrers
The guidelines for bot designers also suggest that ethical bots should
Moderate their rate of information acquisition
Operate only when their server is lightly loaded
Use the head request method whenever possible
The request method of an HTTP request message determines the type of action the web
server should perform on the resource requested by the web client. A web server responds to a
GET request by sending a message, containing of some header information along with a message
body, which contains the requested file. On the contrary the response to the HEAD request
incurs less overhead because it conta ins only the message header. Due to this reason bots are
encouraged to use the HEAD request method. In principle, one can examine a session with a
large number of head requests to discover ethical bots.
In addition, sessions having a large number of requests with unassigned referrer fields are
looked into. The referrer field is provided by the HTTP protocol to allow the web client to
specify the address of the web page that contains the link of the client followed in order to reach
the current requested page. For example, whenever a user requests for a particular type of mobile
by clicking on hyperlink found in the deals page, the user’s browser will generate an HTTP
request message with its referrer field assigned to the deals page. Since most bots do not assign
23
any value to their referrer fields, these values appear as “-” in the Web Server logs.
Both of these heuristics are found to be not entirely reliable because non bot sessions can
sometimes generate HEAD request messages (example, when proxy servers attempt to validate
their cache contents) or HTTP messages with unassigned referrer values (example, when a user
clicks on a bookmarked page or types in a new URL in the address window) and hence this
cannot be considered for bot detection.
3.2 LOG PROCESSING AND BOT DETECTION
The Analysis team performs offline processing for user click-stream data. In this system,
bot data is excluded from much of processing as it creates noise and also would report incorrect
activity taking place on site. The user click-stream data and analysis is important as it is source
for capturing results of experiments via A/B test. A/B testing is done to serve the user a page of
type A or type B based on the analysis of the user’s activity. The log tracking today captures
activity of bot as the click stream data is produced via server side tracking. It is expected that
conducting client side tracking will simplify this part of the problem many bots likely wouldn't
execute any of client side scripts. To determine the bots in a more accurate way their behavior is
observed.
3.2.1 Behavioral bots
Sometimes, the automated systems (unfriendly bots) tend to shield behind a real looking
user-agents. Since these bots cannot be detected by looking for standard user-agent string, the
self-declared bot detection mechanism doesn't work for these cases. In order to detect these bots,
behavior of a set of requests are looked into, to determine whether it is associated with
automated system. The way automated systems are identified is using a set of rules. The rules are
based on certain patterns with Hourly bots looking at events within single session like dwell time
in activity within session, large number of events in single session, sessions with large number of
24
Captcha challenges etc., while the daily bots look at events across multiple sessions such as large
number of single event sessions per agent, only a few IPs with each having single click session
per Agent, all sessions of given IP having single click session.
Bot detection based on user agent string and sessions having large number of events are
responsible for detecting most of the bots with total bots detected by these rules being more than
95-98%.
3.3 TRACKING BOT DETECTION
The tracking system performs bot detection. The detection mechanism as described
above is majorly based on certain user-agent patterns and IP ranges based on a list. The bot rules
used by tracking platform are stored in database and these are used to determine A/B tests
(example, when to redirect traffic to other pools, try out variations in an application, etc.). The
bot rules have been created a while ago work almost as a static list (even though there is
provision to update DB and thus not require code changes).
3.4 LIMITATION OF CURRENT BOT DETECTION
The analysis done sheds a lot of light on the areas where bots are missed out. The
drawbacks related to the major two techniques that detect most of the bots are elaborated.
3.4.1 User Agent Based Bots
The user-agent based bot detection as seen is primary way of detecting bots. However,
user-agent strings are not evolved as these rules (which regex patterns to use) have been in use
for many years. Example, The rules detect for user-agent of msnbot which is not
frequented anymore. Instead, it does not use bingbot. The constant evolution of these rules in a
periodic manner will help deal with such limitations. Hence evolving the bot-detection
mechanism by looking at additional rules that can be formed by bot analysis is looked into.
25
3.4.2 Behavioral Bots
The log processing use behavioral bots as user-agent based ones go only thus far. In the
log processing, many of bot-detection rules are session oriented and session attributes like
whether single click sessions take place is important aspect of this logic. The single click
sessions are equivalent of having new guid being used every time as the cookies are not passed
by the automated system (or, click- id is not incremented due to lack of JavaScript).
These conditions are easier to check directly by looking at unique set of session- id for a
given user agent. Similarly, the presence or absence of JavaScript can be directly checked by
whether the java-script is supported by user-agent.
3.5 MOTIVATION FOR BOT ANALYSIS AND IMPROVED DETECTION
To overcome the mentioned drawbacks, a more robust technique is needed to
identify visits by camouflaging and previously unknown bots. The current bot detection
algorithms make use of user-agent for good number of observation. The behavioral bots look at
IP addresses in addition to user-agent string and guid (for sessionization) to determine whether
request is bot- generated or not. However, these do not look at additional attributes for making
this determination. These additional aspects are percentage of requests with referrer,
percentage of requests with signed in users, percentage requests with unique guids, percentage
of requests with JS enabled, percentage of requests with unique click id. Also, the top and
bottom of funnel can be looked to determine bots. Example, the search pages may have different
set of user-agent requesting than say checkout pages. The difference in this user-agent may
indicate cases where we have bot as these may be hitting pages at the top of funnel (search
pages) but not hitting those at the bottom of the funnel (checkout). These additional attributes
can be used to create better and more targeted bot detection rules.
26
3.5.1 LACK OF FEEDBACK MECHANISM
The method of detecting whether a request is coming from a bot is being done based on
static rules. This mechanism can be evolved by looking at bots and recent past and feeding that
back to the system to determine the request as bot. Also, this feedback mechanism can be within
application-server processing logic so that other downstream application and systems can benefit
from this data. Also, capturing them in real-time has benefits with traffic shaping and can also
help with reduced server infrastructure.
The feedback mechanism will also help with detecting bots faster as the log processing
today has to collect data for past 24 hours. This processing generates significant delay in current
log processing pipeline as end-of-day processing takes a long time. Having this feedback cycle
will help detect many of the bots at the front-end server itself. The detection of bots at the front-
end will also help with traffic shaving by reducing the amount of extra capacity that many need
to be built out.
27
CHAPTER 4
EXPERIMENTS AND RESULTS
The additional analysis that has been done on the log data and ways in which the bot
detection mechanism has been tremendously improved as its result has been elaborated.
4.1 BOT ANALYSIS SETUP
The bot analysis is done by obtaining a list of web requests and capturing user-agent,
referrer, IP address and LOG tags emitted. The data are captured via Pig based MapReduce jobs
and CAL regular expression searches that captures this data from CAL logs. The analysis is
done for a period of time (for several hours) and analysis was done by sampled set of data
followed by full data for period of time. The sampled data (using regex options) was used to
understand patterns and later the analysis is repeated on full set. The analysis is done for
different applications like search, view item, checkout to perform traffic comparison between
them.
4.2 BOT DETECTION SYSTEM DESIGN
The proposed bot detection design includes the feedback mechanism as depicted in
Figure 4.1 which is lacking in the existing system. This bot detection mechanism is going to
fetch the data from the Hadoop HDFS and run periodic MapReduce jobs on it to determine the
bot- signature. In the feedback logic, it is important to detect bot-signature so the newer requests
can be appropriately tagged. The signature would be to identify requests that lack corresponding
conversion. The application server can hit this service upfront to detect the bots more efficiently.
To determine the bots on a real time basis jet stream is proposed for near real time analysis and
generation of bot signatures.
28
Figure 4.1 Bot Detection System Design
4.3 CHECKOUT AND SEARCH PAGE COMPARISION
The search tends to be top of the funnel where users tend to start with. The checkout
tends to be bottom of funnel where transactions are completed. As many of the bots are for
gaining competitive information, it is expected that bots may be seen by search pages and not by
checkout pages. The comparison of user-agents seen by checkout and search page show that, as
expected, majority of bots are detectable by comparing data between search and checkout pages.
In fact, this system is able to detect additional bots than the current statically maintained bot-
rules are able to detect. The graph depicted in Figure 4.2 shows that comparing data between
checkout and search pages may be able to detect 30-40% higher number of bots than something
that is currently possible based on LOG or Firemarshal based bot rules. This simply means bot
detection by looking at top and bottom of funnel can be useful way of knowing user-agents that
29
are potential bots. In fact, this comparison detects not only the current bot rules but also
additional user-agents that are not found with current static set of rules maintained by LOG
and Firemarshal.
Figure 4.2 Comparison of different mechanisms
4.4 ANALYSIS WITH ADDITIONAL ATTRIBUTES
This analysis focuses on looking at additional attributes that can be used for bot analysis.
The user-agent was primarily used for bot detection. The additional attributes are part of LOG
event and the analysis has been done for the following:
Referrer: The requests that have referrer header included.
Unique IP count: The requests that have unique IP addresses
Unique user count: The requests that have users accessing the page
Unique best user count: The requests that have user and best user present
30
Unique guid count: The requests that have unique guids
JS presence: The requests that have cookie indicating that the agent has JS support
Unique click id: The requests that have unique click id on page
Checkout conversion: The fraction of checkout traffic for a given user-agent (%). This
compares the search to checkout traffic.
The analysis shows that bots detected using this mechanism match with those detected
with checkout based analysis. Particularly, the unique number of users accessing the page for a
given user-agent is good way of detecting whether or not that user-agent is bot. The strong
correlation that is seen is almost uncanny and points to:
Bots don't often modify user-agent
Bots tends to use old user-agent strings
Some user-agent strings are from real as well as bot traffic. Additional analysis will be
needed to separate them.
4.5 FINDING LOGGED IN USERS
The analysis is done to determine the percentages of users that are signed in. As the
analysis with additional attributes happens to capture this detail, it is easy to analyze this pattern.
The data shows that 75-80% of users are either logged in or previously logged-in for search
pages (with "u" or "bu" tag). As the search happens to be top of the funnel, this number is, as
expected, higher when looking at checkout and other pages. This is useful finding as it was
thought to be smaller % based on earlier analysis. The 75-80% signed-in users mean that user
based sessionization, tracking and experimentation as real possibility.
31
4.6 ANALYSIS OVER TIME
In this analysis the bots are determined based on the number of unique user requests
coming in for a particular user-agent for a particular hour. This process is repeated over a period
of time to classify the requests as bots or not. Now analysis is done to determine how often the
bots change, i.e., if the same list of bots are used after a day or week what percentage of bots can
be detected using the old list assuming at the 1st hour all the bots are detected.
The following graphs show the results of analysis done for a period of 2 weeks in search page
depicted in Figure 4.3 and view item page depicted in Figure 4.4.
Figure 4.3 Search page result
32
Figure 4.4 View Item page result
4.7 INSPECTION OF DATA
Further drill down of the data obtained shows a lot of insights.
4.7.1 Inclusion of IP
User agents that are having a small ratio of unique users were marked as bots initially.
These include most of the bots but certain agents seem to have both genuine user requests as well
as bot requests coming in. To resolve this, only for such a mix of agents a further drill down
based on IP is done. This helps us to distinguish between requests coming from bots and actual
users when they are coming through different IP’s and hence improve the accuracy of our
findings.
33
4.7.2 Signed in users
Out of the non bot users, certain agents are mis-tagged. For such agents signed in users
are found to classify them accurately. To improve the classification capability agents where java
script is not executed is calculated which determines the signed in users in a better way.
4.8 EVALUATION OF THE CURRENT REAL-TIME BOT DETECTION
The tracking library performs real-time up front bot detection at the application servers.
They use a combination of user agent and IP address against a set of known bot signatures. To
evaluate the performance of this detection mechanism, i.e., to determine how much of the actual
bot traffic these methods are able to flag, the logs flagged by the business intelligence listeners
are used as a reference point. The business intelligence uses an elaborate detection
mechanism based on sessionization and applying an elaborate set of rules on them.
There is a huge difference observed between the existing real time detection and the
business intelligence mechanism used offline to tag the bots. Some investigations are done to
analyze the gap so that the bot detection mechanism can be improved.
4.8.1 Targeting pages
Certain user agents are marked as bots as they do not provide a user agent field. On
looking at the pages through which these requests come in, it is found that they are having a
null value in the user agent field because of certain ajax calls that are made internally. The
relation between the page ids and page names is not one to one and a lot of mapping has to be
done to make sense of the pages. These observations reveal the reason behind certain agents that
have requests coming in from such pages being marked as bot. Hence the difference can be
reduced by looking at the requested pages.
34
4.8.2 Inspecting bot rules
To get a better understanding of this difference the bot rules through which the offline
processing is done has to be scrutinized. Based on the result the upfront detection logic can be
improved by understanding and applying certain rules by sessionizing and detecting the bots in
an improved fashion.
35
CHAPTER 5
CONCLUSION AND FUTURE DIRECTION
Initial findings show that roughly 31% of the bot traffic is tagged in real- time. The reason
behind this low percentage is the shortcomings of the existing methods. Upon analysis of the
data it could be seen that some of the self-declared bots are missed by the tracking library,
possibly because of deprecated bot signatures. This difference alone amounts to 157,662363
records, i.e. approximately 14% increase can be achieved just by adding the user agents of these
self-declared bots to the bot signature set. Further analysis also revealed that by considering the
techniques that have been analyzed above a steep increment was observed in the detection of
bots in an upfront manner. A considerable amount of around 34% increase was obtained by using
the additional techniques that have been analyzed over the course of time. This has proved to be
quite useful as detection of bots upfront in the application server will help in reducing the load
that has to be processed by the UBI down the stream.
The bot detection is needed in application server to allow different systems to use this
information within application server (example, Firemarshal, Tracking Platform, etc.). This
will help to avoid duplication of effort on bot detection and to reduce downstream processing for
offline processing systems like UBI processing and Private Eyes.
The existing system was not very efficient and was detecting a meager 31% and was
missing out many obvious and complex bots that were scrapping the eBay pages. The feedback
based mechanism will help with keeping up with changing bot structure. In the feedback
logic, it is important to detect bot-signature so the newer requests can be appropriately
36
tagged. The signature would be to identify requests that lack corresponding conversion. These
can be done by periodic batch jobs (hourly or daily) and on real-time basis. The in depth analysis
has thrown a lot of light into where exactly it was missing out and also provided improvements
which further enhanced the bot detection mechanism in the application server and the goal of
80% bot detection is achieved.
Another future direction of analytical research done would be to sessionize the requests
to obtain further insights at the application server and enhance the bot detection mechanism
further. The tracking data for bots can be further split and transported separately from non-bot
data. This can be done primarily to reduce the amount of processing needed for downstream
processing and provide higher reliability and priority for non-bot data. The split of this data can
be done downstream of publisher so that non-bot data can be sent on a parallel channel compared
to bot data.
37
BIBLIOGRAPHY
REFERENCES
Allan Gates, “Programming Pig”, O’Reilly Media, Inc., 2011.
Douglas Brewer and Calton Pu, A Link Obfuscation Service to De tec t Webbots. IEEE
Services Computing (SCC), IEEE Computer Society, (2010)
Tan, Pang-Ning, and Vipin Kumar, Discovery of web robot sessions based on their
navigational patterns. Intelligent Technologies for Information Analysis. Springer Berlin
Heidelberg, 2004. 193-222.
WEBSITES
http://wiki2.arch.ebay.com/display/YELLOSTONE/Home
http://wiki2.arch.ebay.com/display/YELLOSTONE/SOJTagExtrration
http://wiki2.arch.ebay.com/display/YELLOSTONE/BotDetectionandAnalysis
http://wiki2.arch.ebay.com/display/YELLOSTONE/BotAnalysis
http://wiki2.arch.ebay.com/display/CAL/Home
http://Git-scm.com/documentation
http://maven.apache.org/
http://viralpatel.net/blogs/introduction-apache-maven-build-framework-build-automation-
tool/
http://Hadoop.apache.org/
http://Pig.apache.org/
https://sites.google.com/site/winstoninada/playing-with-pig
http://maven.apache.org/
http://blog.credera.com/technology- insights/java/jenkins-continuous- integration/