1
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute they are not even noticed. This is exactly the same for web browsing. The way that each human browses the web is unique to that person. The websites they visit as well as the order in which they visit them is unique. Now wouldn't it be nice if this uniqueness was not just overlooked and was actually used to benefit the user’s browsing experience. In this research we compare different representations of browsing histories to find which one will best be used to represent this uniqueness. Then by using machine learning algorithms this research will attempt to create a fingerprint from which a user could be identified based on there web-browsing history alone. Representing the browsing history to the computer: Every users history is stored in a database file of different types depending on the browser they use. These database files contain a great deal extra data beyond just the webpages visited. For the purpose of this research I have stripped all this extra data off to create an ordered list of every webpage visited. The next step is to turn this list into a data set that could be used by the computer. The simplest data set that could be created would be similar to the one above. Where each column would represent a different “Attribute” of the data. In this case each Attribute would represent a webpage in the set of all webpages. The last column contains the user names. This is the column that will contain an empty cell when we Finding out what makes each history unique: In order to be able identify someone from their browsing history, there must be something that makes each browsing history unique. After carful analysis and bit of common sense I have deduced that there are three main features of each browsers history that makes it unique. These features are: 1.The websites that have been visited 2.The number of times each website has been revisited 3.The order in which each website has been visited Manipulating the dataset to represent the history’s uniqueness: The task of representing which webpages have been visited and how many times each website was visited was simple. In the simple data set shown below you can see that every website the user visited is already represented. Represent the number of times that each website is visited is also a simple task. Each attribute value could instead of being a binary yes or no, it could be a number representing how many times the site that attribute represents was visited. This leads to another question about what it means to revisit a website because the history stores every “webpage” visited. To solve this problem my research counts both webpage and website visits and creates datasets for both. Website: http://www.union.edu Webpage: http://cs.union.edu/Poster/posterguidelines. html Manipulating the dataset to represent the history’s order: In order to create one data set with multiple users on it then it requires that each user has the same attributes. Making it not possible to preserve any order for every user after the first user. To solve this problem I have employ a technique from natural language processing called n-grams. In NLP n-grams are used to group words together and help predict parts of speech. The “N” in n-grams stands for the number of grams grouped together. A gram can be any variable that exists in an ordered list. In my research a gram is the site visited. A representation of how the dataset would look for a tri- gram representation. You can see in the first instance that Bob visited Site2 then Site3 then Site 5. The n-gram technique also has another variable of skips. A skip would represent the amount of grams skipped before recording another gram. A dataset for a 2 skip tri-gram would look exactly the same as the one above except that two sites would not have been next to each other in the original history. For example in the first instance Bob would have visited Site2 the two more sites then Site3 then two more sites then Site5. The Experiment: 1.Collect browsing histories from volunteers. 2.Strip the extra data out of the collected browsing histories 3.Create data sets. Which includes: 1. A separate dataset for every combination of n- grams and skips from a 0 skip bi-gram to a 50 skip 50-gram 2. One of every previous dataset for both website and webpage specificity 3. Then splitting every dataset into two sets. 1. 80% training set 2. 20% testing set 4.For every dataset train a classifier and test it with its corresponding test set. 5.Evaluate the results to find which representation of the data will yield the highest percentage of correct predictions. Using the created datasets: After researching different techniques and I found that learning classifiers were best suited for this task of identification for three main reasons: 1.They use simple datasets that are easily manipulated 2.Classification is a similar task to identification 3.A great deal of classifier algorithms have already been developed are readily available to be used through the Weka library. 4.Tools are readily available to evaluate the correctness of a classifier’s results on classifying a dataset

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute

Embed Size (px)

Citation preview

Page 1: Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute

Overview:Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute they are not even noticed. This is exactly the same for web browsing. The way that each human browses the web is unique to that person. The websites they visit as well as the order in which they visit them is unique.Now wouldn't it be nice if this uniqueness was not just overlooked and was actually used to benefit the user’s browsing experience. In this research we compare different representations of browsing histories to find which one will best be used to represent this uniqueness. Then by using machine learning algorithms this research will attempt to create a fingerprint from which a user could be identified based on there web-browsing history alone.

Representing the browsing history to the computer:Every users history is stored in a database file of different types depending on the browser they use. These database files contain a great deal extra data beyond just the webpages visited. For the purpose of this research I have stripped all this extra data off to create an ordered list of every webpage visited. The next step is to turn this list into a data set that could be used by the computer.

The simplest data set that could be created would be similar to the one above. Where each column would represent a different “Attribute” of the data. In this case each Attribute would represent a webpage in the set of all webpages. The last column contains the user names. This is the column that will contain an empty cell when we don’t know who the browsing history belongs to. Each row would contain the browsing history from each user. These rows are called instances.

Finding out what makes each history unique:

In order to be able identify someone from their browsing history, there must be something that makes each browsing history unique. After carful analysis and bit of common sense I have deduced that there are three main features of each browsers history that makes it unique. These features are:

1. The websites that have been visited2. The number of times each website has been revisited3. The order in which each website has been visited

Manipulating the dataset to represent the history’s uniqueness:The task of representing which webpages have been visited and how many times each website was visited was simple. In the simple data set shown below you can see that every website the user visited is already represented. Represent the number of times that each website is visited is also a simple task. Each attribute value could instead of being a binary yes or no, it could be a number representing how many times the site that attribute represents was visited.

This leads to another question about what it means to revisit a website because the history stores every “webpage” visited. To solve this problem my research counts both webpage and website visits and creates datasets for both.

Website: http://www.union.edu

Webpage: http://cs.union.edu/Poster/posterguidelines.html

Manipulating the dataset to represent the history’s order:In order to create one data set with multiple users on it then it requires that each user has the same attributes. Making it not possible to preserve any order for every user after the first user. To solve this problem I have employ a technique from natural language processing called n-grams. In NLP n-grams are used to group words together and help predict parts of speech.

The “N” in n-grams stands for the number of grams grouped together. A gram can be any variable that exists in an ordered list. In my research a gram is the site visited. A representation of how the dataset would look for a tri-gram representation. You can see in the first instance that Bob visited Site2 then Site3 then Site 5. The n-gram technique also has another variable of skips. A skip would represent the amount of grams skipped before recording another gram. A dataset for a 2 skip tri-gram would look exactly the same as the one above except that two sites would not have been next to each other in the original history. For example in the first instance Bob would have visited Site2 the two more sites then Site3 then two more sites then Site5.

The Experiment:1.Collect browsing histories from volunteers.2.Strip the extra data out of the collected browsing histories3.Create data sets. Which includes:

1. A separate dataset for every combination of n-grams and skips from a 0 skip bi-gram to a 50 skip 50-gram

2. One of every previous dataset for both website and webpage specificity3. Then splitting every dataset into two sets.

1. 80% training set2. 20% testing set

4.For every dataset train a classifier and test it with its corresponding test set.5.Evaluate the results to find which representation of the data will yield the highest percentage of correct predictions.6.Report on the findings

Using the created datasets:After researching different techniques and I found that learning classifiers were best suited for this task of identification for three main reasons:1.They use simple datasets that are easily manipulated2.Classification is a similar task to identification3.A great deal of classifier algorithms have already been developed are readily available to be used through the Weka library.4.Tools are readily available to evaluate the correctness of a classifier’s results on classifying a dataset