E-Fencing Detection: Mining Online Classified Ad Websites for … · 2014-10-31 · E-Fencing Detection: Mining Online Classified Ad Websites for Stolen Property by Christopher Carver

E-Fencing Detection: Mining Online Classified Ad Websites for Stolen Property

by

Christopher Carver

A Thesis Submitted in Partial Fulfillment

of the Requirements for the Degree of

Master of Science

in

Computer Science (MSc)

The Faculty of Business and Information Technology

University of Ontario Institute of Technology

September, 2014

©Christopher Carver, 2014

ii

Abstract

With the emergence of e-fencing, there presents a need to automate both the

detection of ads selling stolen property and the reporting process for victims. This thesis

presents a framework that dynamically retrieves and classifies online ads utilizing

artificial intelligence (AI) to minimize amount of domain knowledge required. Evaluating

these ads against existing known characteristics of theft as well as extracting new

characteristics from suspicious ads. This in conjunction with a reporting system allows

users to report events of theft and matches them to previously classified ads. The

framework was designed such that it would be domain portable and allow for rapid

adaptation to other domains. Experiments showed promising results, correctly classifying

single and multiple trend datasets, displaying anomalies in price histograms, and

extracting potential patterns that explain price variance. Experiments on other domains

highly susceptible to scams displayed unique results contradicting some fundamental

assumptions of the behavior of thieves.

Keywords: e-fencing; stolen property; classification problem; genetic algorithms; rule

extraction

iii

Acknowledgments

I would like to express my deepest gratitude to my supervisor Dr. Xiaodong Lin

for his useful comments, patience, guidance, and engagement throughout the learning

process of this master thesis. I learned many valuable lessons from him; it was an honor

for me to work with him.

iv

Table of Contents Abstract ................................................................................................................................ ii

Acknowledgments............................................................................................................... iii

List of Figures ..................................................................................................................... vi

List of Abbreviations ........................................................................................................ viii

Chapter 1 Introduction ................................................................................................... 1

1.1 Background and motivation ................................................................................. 1

1.2 Objectives and methodologies ............................................................................. 4

1.3 Contributions ........................................................................................................ 6

1.4 Thesis organization .............................................................................................. 8

Chapter 2 Related Work ................................................................................................ 9

Chapter 3 E-Fencing Detection Framework ................................................................ 14

3.1 Ad Retrieval ....................................................................................................... 18

3.2 Ad Classification ................................................................................................ 19

3.2.1 AI Model ..................................................................................................... 19

3.2.2 Chromosome Representation ...................................................................... 21

3.2.3 Fitness Overview......................................................................................... 22

3.2.4 Fitness Function .......................................................................................... 26

3.2.5 Population Transition .................................................................................. 29

3.2.6 Termination Condition ................................................................................ 30

3.2.7 Data Clustering ........................................................................................... 31

3.3 Rules and Rule Database .................................................................................... 32

3.3.1 Application of Primary Attributes .............................................................. 34

3.3.2 Application of Secondary Attributes .......................................................... 37

3.3.3 Rule Extraction ........................................................................................... 40

3.4 Reporting Database ............................................................................................ 41

3.5 Experiments........................................................................................................ 44

3.5.1 Experimental Environment ......................................................................... 44

3.5.2 Experiment – Online Post Classification Accuracy .................................... 45

3.5.3 Experiment – Price Extraction and Averaging ........................................... 49

3.5.4 Experiment – Rule Extraction on Suspicious Clusters ............................... 52

v

3.6 Discussion .......................................................................................................... 57

3.6.1 Ad retrieval ................................................................................................. 57

3.6.2 Price Extraction........................................................................................... 57

3.6.3 Rule Extraction ........................................................................................... 59

Chapter 4 Reporting System ........................................................................................ 60

4.1 Design................................................................................................................. 60

4.2 Domain Knowledge Structure ............................................................................ 65

4.3 Searching process & Indexing Classified Posts ................................................. 67

4.4 Updating Domain Information ........................................................................... 70

4.5 Periodic re-checking user reported cases ........................................................... 73

4.6 Experiments........................................................................................................ 75

4.6.1 Experiment – Active Search using the Reporting Database ....................... 75

4.7 Discussion .......................................................................................................... 76

Chapter 5 Framework Portability and Applications in Other Domains ...................... 77

5.1 Metropass Scams ................................................................................................ 79

5.1.1 Ad retrieval ................................................................................................. 79

5.1.2 Primary and Secondary Attributes .............................................................. 79

5.1.3 Experiment .................................................................................................. 80

5.2 Rental Property Scams ....................................................................................... 81

5.2.1 Ad Retrieval ................................................................................................ 81

5.2.2 Classification............................................................................................... 82

5.2.3 Clustered Data............................................................................................. 82

5.2.4 Primary and Secondary Attributes .............................................................. 83

5.2.5 Experiment .................................................................................................. 83

5.3 Discussion .......................................................................................................... 85

Chapter 6 Conclusion and Future Work ...................................................................... 86

6.1 Conclusion.......................................................................................................... 86

6.2 Future Work ....................................................................................................... 87

References ......................................................................................................................... 89

vi

List of Figures

Figure 2-1 - An example of a reported case of theft from [9];.......................................... 11

Figure 2-2 - An example of a reported case of theft from [10];........................................ 11

Figure 3-1 – Proposed framework overview; ................................................................... 16

Figure 3-2 – Visual representation of classification pattern sizes and their intersections. 24

Figure 3-3 - Frequency analysis of words in training post set; ......................................... 25

Figure 3-4 – Harmonic mean of F2 and F3. ....................................................................... 28

Figure 3-5 – Harmonic mean of F1 and F2. ....................................................................... 28

Figure 3-6 - Graph of passed vs. failed chromosome translations based on Equation (6)............................................................................................................................................ 29

Figure 3-7 – Displaying the average population fitness compared the best members fitness over 5 iterations. .................................................................................................... 30

Figure 3-8 – An example of how data clustering creates a tree structure. ........................ 31

Figure 3-9 – Price histogram of Blackberry Bold 9900 posts with an inverse Gaussian distribution of suspicion.................................................................................................... 36

Figure 3-10 - Diagram of process of the framework undergoes while determining suspicious posts................................................................................................................. 44

Figure 3-11 - Non-uniform distribution of training set posts; .......................................... 45

Figure 3-12 - Highest fitness classification patterns for three iterations; ......................... 46

Figure 3-13 - Uniform distribution of training set posts;.................................................. 47

Figure 3-14 – Example of post containing aliasing information for “Blackberry” and “BB” based on extracted classification patterns. .............................................................. 49

Figure 3-15 – Price histogram of Blackberry Bold 9900;................................................. 49

Figure 3-16 – More granular price histogram of Blackberry Bold 9900; ......................... 50

Figure 3-17 – Blackberry Bold 9900 Price Histogram resembles that of function |𝐬𝐢𝐧𝒙𝒙|;........................................................................................................................................... 51

vii

Figure 3-18 – A post from a suspicious cluster that has a low price and indicates that only “cash pick up” is acceptable, ............................................................................................. 52

Figure 3-19 – Post of suspicious cluster that uses a stock photo and requires that the transaction occurs in person. ............................................................................................. 52

Figure 3-20 – Price histogram of IPhone 4s; .................................................................... 53

Figure 3-21 – An example of possible semantic analysis;................................................ 58

Figure 4-1 – Fully annotated relations between nodes of the domain information tree. .. 61

Figure 4-2 – Limited annotated relations between nodes of the domain information tree............................................................................................................................................ 61

Figure 4-3 – Displaying user input process as it traverses down the domain information tree;.................................................................................................................................... 63

Figure 4-4 – An example of what the table structure and potential table input for report

information attributes. ....................................................................................................... 65

Figure 4-5 – This result displays the input requirements based on which node is selected;

........................................................................................................................................... 66

Figure 4-6 – The domain knowledge tree that was used for Figure 4-5. .......................... 66

Figure 4-7 - A graph of Equation (7) displaying the confidence as the number of

transitions increases. ......................................................................................................... 68

Figure 4-8 – An example of how users’ would select the “Other” option should their

desired category not exist within the domain knowledge tree. ......................................... 72

Figure 4-9 – An example of how attribute prediction would function; ............................ 72

Figure 4-10 - Results of the search process; ..................................................................... 75

Figure 5-1- Metro pass histogram using 5% intervals. ..................................................... 80

Figure 5-2- Metropass histogram using 2.5% intervals and extended lower bound. ........ 80

Figure 5-3 - Percentage of image duplications on multiple posts;.................................... 84

viii

List of Abbreviations

AI Artificial Intelligence

BWNT Brand New With Tags

BT Broad Term

DE Differential Evolution

NER Named Entity Recognition

NLP Natural Language Processing

NT Narrow Term

PoS Parts of Speech

RT Related Term

TF-IDF Term Frequency – Inverse Document Frequency

1

Chapter 1

Introduction

1.1 Background and motivation

With the recent shift towards Internet based medium, new venues for selling

goods have emerged. Businesses often maintain a web presence if not additionally

offering ways to shop online through their site. Consumers can sell their new or used

goods through a variety of dedicated sites such as EBay, Kijiji, and Craigslist, while also

using social media platforms such as Facebook or Twitter to advertise these items. While

in a traditional medium, business-to-consumer and business-to-business sales accounted

for a majority of the market, with the advance of the Internet consumer-to-consumer sales

have begun to increase. Due to the rise in the perceived legitimacy of customer-to-

customer transactions in addition to the maturity of the internet, a majority of these

transactions occur online. Unfortunately illegal activities have also migrated to this new

medium, providing more anonymity over traditional methods of selling stolen goods such

as at flea markets or pawn shops. Traditional mediums performed the functional role of a

fence, either knowingly or otherwise, which is described as buying stolen property for the

purpose of later resale [1].

As online shopping has gained immense popularity in recent years, criminals have

also started disposing of stolen goods on the Internet using sites such as cash4gold.com,

eBay, and Craigslist. This forms a new type of fencing called E-Fencing [2]; which is the

act of fencing on the Internet. Given the benefits of this new medium, it is not surprising

that a lot of stolen goods are sold on these sites. In today’s Internet marketing era there

are various reasons that criminals choose to sell their stolen goods online rather than

using flea markets or pawn shops. First, these websites can be a great place to reach large

audiences, and as a result, stolen goods are sold quickly; additionally, the large volume of

similar ads slightly obfuscates the seller due to the large variation in the amount of detail

and asking prices. Since the criminals must balance the risk of holding onto the item for a

prolonged period of time against the risk of having their price cause suspicion, the overall

2

risk decreases as the audience and number of similar ads increases; as there is more

potential for a sale with reduced suspicion provided the price conforms to the market’s

price distribution. Second and most importantly, it is not very hard to mask one’s identity

online and remain anonymous. For example a criminal could use a 3rd party IP address

using a proxy, and surf the web anonymously; the criminal could also use pseudonyms

online to conceal his/her real identity. With the advent of software packages like Tor [3]

users are able to dynamically send traffic through a series of proxy servers making it

extremely difficult to trace. This issue is further complicated by sites also obfuscating or

aliasing users within their databases, such that the same user on different sites may show

no correlation.

However, in the past, people could sometimes recover their stolen goods by

searching local flea markets or pawn shops for their property, then reporting to the police

where their property is located. The police could then track down the criminal by their

fingerprints or identifications left when they brought the goods to the shop. In a similar

fashion there are accounts of people manually searching craigslist in attempt to find their

stolen goods, but this is very tedious. In these cases people often are just manually

searching craigslist or other customer-to-customer selling sites, inputting the details of

their stolen item and analyzing the search results; they are comparing the posts date/time

to the time of the theft and looking for any unique characteristics in the images that

would identify their specific item, often relying on gut instincts to identify their item. The

last aspect is actually both quite important and very difficult to quantify, since we do not

know what characteristics make the item appear suspicious to the subconscious. There

have been a number of moderately successful cases of users manually recovering their

stolen property; such as Jake Gillum [4], who after discovering his bike had been stolen,

began searching online sites to see if his bike would show up. Many of such documented

cases had aesthetic identifiers that differentiated them from their standard model, such as

user modifications or identifying marks of wear and tear. There are other methods of

tracking down lost or stolen property; many modern electronics such as cellphones have

the ability to install GPS tracking applications. However this often must be done

proactively, and does not give a contact method for police to further investigate beyond

3

the physical location. There have been accounts of people contacting the police stating

that they know where their stolen phone is located, but this may not give the police

sufficient grounds to confront the person and search them. As such, this makes cell

phones quite lucrative items to steal, due to their wide availability, high value, and large

volume.

Sadly there are many accounts of people successfully finding their stolen item but

often encounter quite a bit of resistance from the police department in larger cities. This

resistance is understandable due to the large number of reports they receive regarding

stolen property. Often the victim will stage a meeting with the perpetrator but only to

have the police arrive hours later. An example of this would be Jake Gillum, who tracked

down his stolen bike on Craigslist and had to wait 40 minutes for the police to arrive

while he stalled the seller. This displays the complex issue of attempting to recover stolen

property, due to the large volume of cases reported many police departments are simply

overwhelmed and victims are frustrated. This can result in the victim feeling like they are

not being helped and attempting to steal back their property against the recommendation

of the police. An example of this would be Kenneth Schmidgall [5], who tracked down

his stolen IPhone but was then beaten badly when attempting to confront the criminal.

Sites such as stolen911.com [6] have attempted to alleviate some of these issues

by assisting with the searching process. However, this system is restricted to searching

craigslist and does not cross-reference other posting sites. Additionally it does not appear

that any search processing is being done to return more accurate listings related to the

users reported case but simply returning all lists that match the users search criteria. This

leaves the victim and police to search and compare the results with the users reported

stolen item. It should be noted that their search engine leverages the Google custom

search engine to target the craigslist site. Although Google’s inference and searching

logics are very powerful, they are not domain specific nor do they leverage domain

expertise to return more accurate results. This is simply due to the structure and scope of

Google’s search interface, it does not generate a domain specific hierarchy that would

help classification or result reduction. Many issues arise because of users leveraging

4

search engines, such as posts advertising many models in order to be returned in multiple

searches.

Considering these issues, there presents a need for a system that automatically

interfaces with police and the victim, managing the victims’ expectations and informing

them of the status of the investigation, while also reducing the load by automating the

searching process for the police. This would help reduce the likelihood of the victim

personally attempting to recover their property while increasing the polices ability to

respond to active leads on time sensitive theft cases where it is expected the thief will

attempt to sell the item as fast as possible.

1.2 Objectives and methodologies

The first objective of this thesis is to propose a framework that could achieve the

desired task of automating the searching process victims’ encounter when trying to

recover their stolen property. In order to achieve this and simplify the problem we broke

it into two portions, a classification portion and a comparison portion. The classification

portion is responsible for gathering and classifying online posts, while the comparison

portion is responsible for gathering users reported cases and matching them to the

classified posts. While both of these are comprised of sub-components which will be

discussed in greater detail in later objectives, the classification portion has three main

components handling: ad retrieval and domain classification, suspiciousness

classification, and rule extraction. The comparison portion has two main components

handling: user input of reported stolen items and matching reported cases of stolen items

to classified posts.

The second objective of this thesis is to design an automated approach for

classifying items being sold online. In order to achieve this we had to leverage an

artificial intelligence approach to identify patterns within the data clusters and arrange

these clusters in a hierarchical manner. A complete analysis of applicable AI models will

be covered in a later section but we choose to use differential evolution encoded with

5

keywords to extract classification patterns. Once these patterns were extracted their

position in the hierarchy was most often determined by their length due to the extraction

process choosing more narrow keywords as the classification pattern extended. Thus for

each iteration of the classification pattern, we would transition to a more descriptive

subset.

The third objective of this thesis is to design an automated approach for identifying

potentially stolen items. This objective although very similar to the first is inherently

different due to the fact that we are attempting to quantify suspiciousness. Using some of

the techniques from law enforcement, we attempt to identify primary and secondary

attributes of suspiciousness, the most predominant being selling price. Using the price

attribute, we attempt to cluster the posts into varying price ranges and look for anomalies

within the price distribution.

The fourth objective of this thesis is to extract new patterns from the identified

suspicious clusters to allow for new attributes to be discovered. These patterns would

either increase or decrease the suspiciousness of the post based on how they were

applied; for example, should the pattern explain that the price is lower than expected due

to a defect or damages then suspicion can be decreased, conversely should the pattern

indicate that the seller only excepts cash or wants to arrange a meeting away from their

residence then suspicion can be increased. Although a very primitive approach was taken

to demonstrate this capability more complex approaches are discussed in a later section

and could be added in future work. We extracted new patterns using the same

classification system that was initially created except analyzing specific suspicious

clusters or portions of suspicious clusters while using the remainder of the dataset as a

control. This allowed for patterns unique to that price range or cluster to emerge.

The fifth objective of this thesis is to propose a framework that will act as an

interface between the victims of theft and the police. This is to address the large number

of reported cases of theft and limited resources of the police, allowing for some of the

load to be offset to an automated system. In order to achieve this, we had to determine

6

what information would be most likely requested in a police report, along with how much

practical technical information could we expect the victim to know while reducing the

chance of errors. Ultimately this was achieved by leveraging large amounts of domain

knowledge and strict user input, such that the user could navigate down to the appropriate

domain with minimal complexity. At which point they would enter a minimal amount of

information, which would be determined by the domain knowledge, with strict input

control parameters.

The sixth objective of this thesis is to automatically match reported cases of theft

with suspicious posts found online. In order to achieve this we had to analyze the data

used during; the interactions between the victims and the police, the interactions between

the sellers and consumers of goods in an online medium, and the crossover of the data

between these two interactions. This analysis allows us to understand how to transition

from user input that was strictly control and heavily influenced by domain knowledge to

user input that was free-form text and has varying degrees of influence by domain

knowledge, such that there would be a way to compare the reported cases to the classified

posts. This was achieved by using other primary attributes such as date, time, and

location, along with classification patterns and domain knowledge to find the best match

to a reported case.

1.3 Contributions

This research focuses on developing a framework that automates the search

process users or police would go through in attempt to manually recover a stolen item,

specifically the main contributions of this research are three fold:

- A robust domain portable framework is introduced, designed in such a way that

allows it to be modular and possess extremely high domain portability such that it can

be applied to new domains with very little modification. Modules were designed such

that they can function with very little initial domain knowledge, while adapting over

time as the domain knowledge increases. Additionally modules were provided with

methods for harvest new domain knowledge or trigger domain knowledge updates

7

based on the data being encountered; such that the framework was able to handle and

identify information from new domains from both a system classification and user

input standpoint.

- The introduction of the classification system allows for posts from various sites to be

accurately merged into consistent domain classifications while extracting new domain

classifications. This system, in addition to documenting the primary and secondary

decision attributes, provides a basic implementation for these attributes. These

attributes are used when attempting to quantify the suspiciousness nature of a post,

which previously was unquantifiable. This not only provides an initial metric or

training set for other approaches in the form of a publicly available dataset of online

posts with domain classifications and legitimacy quantifications but also serves as a

dataset that is further analyzed by rule extraction module. Rule extraction serves as a

process to either strengthen the decision of the posts suspiciousness or to explain as to

why it may have been misclassified as suspicious, which increases the accuracy of the

system over time.

- The introduction of the reporting system allows for users to submit reported cases of

stolen items with information similar to that which is provided in a police report. This

interface is expected to bridge the gap between law enforcement report databases and

online media. The method in which this is accomplished identifies the limits of the

user’s ability to input information from both a logical and technical standpoint, and

thus prompts them only for the required information in a linear fashion. This

approach allows us to extrapolate semantic information about new domains without

halting the reporting process. This research is also the first that is attempted to

connect the elements of the reported stolen item to actual online posts, that is to say

utilizing the theft metadata of date, time, and location to reduce the search results.

8

1.4 Thesis organization

This thesis consists of seven chapters, chapter 1 presented an introduction to the

work and the subsequent chapters will be organized according to the following;

Chapter 2: In this chapter a number of related works are surveyed in the domains

of criminal psychology, real-world applications that attempt to address this

problem, and various methods of natural langue processing when attempting to

process online text.

Chapter 3: This chapter describes the proposed framework for automated stolen

property recovery in detail. Explaining how the various components of the

framework interact and how information flows through the system to achieve the

desired results. In addition it will describe the classification system in detail,

along with an analysis of the AI selection and training process.

Chapter 4: In this chapter the reporting system is described in detail, specifically

how the domain information is leveraged during the user input process, how the

user input is restricted and controlled, and how the comparison between the users

reported data and the classified posts is handled.

Chapter 5: This chapter presents other domains that the framework can be

extended to and the adaptations that had to be done to achieve domain portability.

The two domains that are explored are rental property scams and metro pass

scams, each of which presents unique domain information requirements while

also demonstrating key attributes that were previously undervalued in other

domains; displaying the variance in attribute weights amongst domains.

Chapter 6: This chapter briefly summarizes the key outcomes of the proposed

framework, and presents some suggestions for future research directions of this

work.

9

Chapter 2

Related Work

There have been a number of accounts of people successfully recovering their

stolen items manually, and even groups such as the Portland Oregon’s Burglary

Taskforce Unit can assist in this process. However this process is far from easy and

requires a lot of time to periodically search a number of sites for your stolen property.

Academically there has been very little work done in attempts to address this issue from a

technical standpoint; however there have been a number of academic works done from a

criminology standpoint. As we discuss these works in more detail, we will be leveraging

some of their methodologies or logics in order to develop the proposed framework; as

such they will form the foundation for the automation of stolen property recovery.

Many of the academic works in this area, such as [7], attempt to document the

methodologies of the criminals during the sale of stolen or fraudulent items in an online

medium. This information is extremely valuable since any technical implementation will

be leveraging the identified characteristics or attributes from these works. A majority of

the cases we look at fall into the category of “short firm” frauds where the time line of the

transaction is relatively short. This should be the case for a majority of small time

criminals due to the nature of the medium and the fact that the items are being sold to the

public; however it should be addressed that career criminals may use specialized online

sites for “long firm” frauds where the goal is not the initial sale of the item but

subsequent fraudulent sales. We would expect to see this from large sales targeting

business but this is beyond the scope of this thesis. Tradwell’s work documents the two

major advantages of selling items online, or reasons why criminal activities have shifted

to sites such as EBay, is the fact that items can be sold over greater distances and

language barriers are overcome by commonly accepted acronyms or abbreviations; an

example of this would be “Brand New with Tags” being represented as BNWT. It also

should be noted that many of the interviewees saw this medium as risk free, although

many sites claim to investigate fraudulent listings they have seen little action of this; it is

even noted that sites provide settings where taxation revenues are avoided. Tradwell also

10

states that the criminals he interviewed had the sole objective of a healthy profit margin

and were impartial to the legality of the venture. This fact, although obvious, does

indicate that price and profit are the most crucial elements in reducing the sale of stolen

goods online; that is to say if we can reduce the profit margin or increase the risk such

that it’s no longer viable to sell these items, this framework will have a positive impact.

Despite this research, there have been very few technical academic attempts to

combat this issue and a majority of the time the site themselves attempt to implement

some sort of fraud detection. However, these approaches differ from site to site and there

is no imperial measurement on the success of these techniques.

The Portland Oregon’s Burglary Taskforce Unit [8] documents ways that

individuals may attempt to recover their stolen property online. Addressing the fact that

stolen items may often turn up on sites such as Craigslist within hours of being stolen,

and thus if people are attempting to manually recover their stolen property they should

check very soon after the theft and consistently afterwards. Although their

recommendations are very good and will form the basis for our approach and assumption,

this process is very tedious for the victim due to the number of sites available to sell the

item and the large volume of posts for a given item.

Sites such as stolen911.com [6] attempt to address this issue directly, allowing

users to report stolen property and search Craigslist through their site. However from the

user’s perspective it is difficult to say what form of integration and processing is

occurring with Craigslist posts. Their site only appears to be a reporting mechanic for

stolen property, where the user can offer a reward should the item be recovered; allowing

major search engines such as Google to index their reported cases which may trigger a

sites fraud detection mechanism. They also provide users with a Google customer search

interface for the Craigslist site, streamlining the searching process for users by restricting

it to just Craigslist and searching all Craigslist area listings. Although this is a very good

initial attempt to address this issue, the amount of information in each reported case is

inconsistent; allowing the users to enter free-form text which results in information being

11

in the incorrect field or in varying formats. An example of this would be the following

reports for blackberry, both of which have sufficient information however the structure of

the information is different between them and the lack of domain information results in

errors. In the first example seen in Figure 2-1, we can see that the time is included in two

fields and there are minor technical discrepancies between them. In the second example

seen in Figure 2-2, we can see that the model color is being included as part of the model.

This displays the need for a deep understanding of the domain information if we

want to automate the comparison of reported cases to online posts as well as very strict

control over the user input. This also presents a classification problem when trying to

determine which domain information should be used when classifying the post.

When looking at the format of text in online posts we quickly see that many

traditional methods of natural language processing (NLP) and named entity recognition

(NER) approaches fall short due to the imperfect nature of user generated content online.

Figure 2-2 - An example of a reported case of

theft from [10]; containing much less detail than

its counterpart in Figure 2-1, the model

information includes the color of the phone which

should be a separate field.

Figure 2-1 - An example of a reported case of

theft from [9]; although it’s an extremely

detailed report, information regarding the time

of the theft appears in multiple fields with

inconsistent technical information.

12

These systems are often heavily reliant on sentence structure to resolve parts of

speech (PoS)[11][12], which in turn are used for feature selection during the NLP

process[13]. Handling online text generally means many spelling and grammar mistakes,

often lacking sentence structure, and many abbreviations; the lack of sentence structure is

especially true in the case of online posts. This prevents effective PoS disambiguation

due to the relatively small sentence size and domain specific contexts. Due to these

issues, traditionally trained NLP techniques would not be applicable and would require

retraining for either very simple or extremely complex rules; this would often result in

very high false positives or misclassifications, and would require a dedicated training set

for each target domain. This unfortunately makes this approach more of an expert system

only trained in the target domain and requires excessive overhead to train the system.

Similarly the use of NER approaches discussed in [14] to detect models or other key

words, which would serve as identifiable characteristics of a post, also fails due to similar

issues with grammar and sentence structure. Although this approach is slightly more

robust due to its reduced training requirements, the number of possible abbreviations

results in a large number of training post being required in order to identify specific

abbreviations; additionally fundamental assumptions such as capitalization and proper

punctuation are not valid for online text, which are traditionally used as boundary

markers to identify entities.

Due to the fact that existing approaches cannot directly be applied to solve this

issue, we must leverage the successful elements of the manual recovery process and other

works discussed in order to fully automate this process. We propose a framework to

address these issues by automatically pull ads from major traffic sites such as Craigslist,

Kijiji, and other popular classified ad sites to process them. This would involve

determining what the ads were attempting to sell, including gathering characteristics

about the item, and extracting useful information about trends in the markets. Posts that

do not follow the market trends or expected rules would be matched against reported

stolen items; an example of this would be if there are a larger number of posts then

expected at an obscure price relative to the normal market distribution then we would

compare these posts against reported cases. This approach is different than the other

13

systems attempting to solve this issue because it converges the search process for many

popular classified posting sites into a single system, while attempting to classify their

posts as to the actual item being sold. This framework would dynamically clusters posts

with the same classification to extract characteristics about them; allows for the system to

adapt to new items introduced to the market and even new domains with very little initial

domain knowledge. Optimally no domain knowledge would need manual entry to the

system over its lifespan. This clustering also allows for dynamic rule generation to

identify immerging suspicious trends in the same manner it would identify new models.

The proposed use of a framework to automate this process has not really been approached

in the past and because of this, training data for identifying suspicious posts is not

available. Future work in this area would be able to leverage our systems classifications

of suspicious and non-suspicious data sets as both a training set and/or a benchmark for

their work.

14

Chapter 3

E-Fencing Detection Framework

In this chapter we will be presenting the proposed framework which is comprised

of the following major components; ad retrieval, ad classification, rule database/rule

extraction, and reporting system. The framework as a whole can be divided into two

parts, the first being the passive identification of suspicious posts and the second being

the active searching for matches to reported stolen items. In this chapter we will focus

primarily on the passive identification components, specifically the ad retrieval, ad

classification, and some aspects of the rule database, in which we propose a novel

approach which analyzes classified ad websites for stolen goods. First we will describe

the framework as a whole, and then describe in detail each component of the system.

The proposed framework begins by regularly crawl websites such as Kijiji,

Craigslist, etc. and fetching pages or posts. This phase can be referred to as the Ad

Retrieval phase and occurs periodically to maintain an up to date database of posts from

these websites. We then introduce a category-based information retrieval system to

extract characteristics of the item in the posting, along with all the posting metadata such

as time, price, seller, location, etc. Although websites often pre-classify their postings

into a set of manually predefined categories, this information may not always be accurate

due to user error or placing the item in the incorrect category. Users may even

intentionally misclassify their item to increase the number of views their post gets.

Because of the potential of incorrect classifications, all posts are processed through the

classification system to get the best classification within the system.

Once the posts have been roughly categorized into clusters, further analysis of the

categories can be done. This involves attempting to extract domain knowledge of the

given topic, such as average selling price and standard deviation per model, models

available, etc. This analysis would result in two kinds of domain specific knowledge,

identifiable characteristics and “suspicious” characteristics. An example of identifiable

characteristics would be something that further breaks down the category of Blackberry

15

into the various colors of the models, which would decrease the number of posts that

would have to be searched if someone was looking for a specific color of Blackberry.

This knowledge is used in place of or to further enhance the initial domain knowledge

provided to the system. Similarly an example of “suspicious” characteristics would be

the extracted trend of current market price of the model, posts that are selling for far less

than that would be suspicious. This process also alleviates the need to constantly update

the system with the newest domain knowledge or market price as the system recalibrates

the various classifications and market price automatically. It should be noted that an

initial sanity check may have to be conducted initially to prevent a large number of false

ads from skewing the market trend to a lower price.

Once a category has had its key features extracted, these can be stored and

referenced in a rule database so that all posts do not need to be reanalyzed. Once

categories features have been defined, posts that fall outside the normal distribution or

violate a rule are flagged as suspicious, the more rules violated increases the confidence

of the suspiciousness. An example of this would be someone selling a phone for far less

than the market average, or using a stock photo or a photo from another post that we’ve

already processed.

At this point the active identification process begins, utilizing the previously

derived information to match reported cases. Users will submit a report for their stolen

item through the reporting system, allowing us to control the input method for the

characteristics of the item, while also allowing the system to leverage any domain

knowledge to prompt for key characteristics. This removes the needs to apply any natural

language processing techniques to the users input since it is fixed field input. Once the

user has submitted the information about their stolen item the system will attempt to

match it to the classified posts and return any potential matches to the authorities. This

searching priority leverages the fact that we have identified suspicious posts and thus

they should be searched first, but will continue searching non-suspicious posts should

there be insufficient confidence in previous matches. This addresses the fact that there

may be better matches in non-suspicious posts or that the system is simply not calibrated

16

to detect something that law enforcement agencies can. This two stage process of

searching reduces the initial search domain by searching suspected stolen goods first,

followed by other posts. Should the system rules be accurate enough, the searching

process could be stopped after the first stage of checking suspicious posts. A diagram of

how these components interact can be seen below Figure 3-1.

Kijiji Craigslist Etc.

Ad Retrieval

Classifier

Clustered

Data

Rules

Suspicious

Posts

Normal

Posts

Reporting

System

Check 1

Check 2

Rule

Database

Figure 3-1 – Proposed framework overview; the system first retrieves ads from various sites,

followed by classifying and clustering these posts from which rules can be applied or extracted. Once

the data is classified and separated into suspicious and non-suspicious designations they can be

searched based on user reports submitted to the reporting system. This process will loop searching

all newly classified posts until a match is found.

To illustrate how the proposed system works, let us use an example of a cell

phone as a case study. The system will retrieve all posts from classified ad sites, which

the classifier will then automatically attempt to classify posts based on keywords;

therefor posts containing the cellphone manufacturer RIM (Research In Motion) will be

clustered together. It should be noted that many other clusters would be formed before

this but for the purposes of this case study we will only focus on those that happen after

this initial clustering. From this point sub-clusters would form based around cellphone

characteristics, such as model, color, condition, or specifications. We would expect to see

clusters of characteristics such as “Blackberry”, “Bold”, “32 GB”, etc.

17

Once all classifications have been done, these clusters would then be analyzed by

the rule database, checking for suspicious word strings, abnormally low prices, or other

characteristics that would lead to the item being deemed suspicious. From this point

suspicious posts are labeled and have a higher priority in the searching process that is to

follow.

At this point any reported stolen items submitted to the reporting system, where

the active component of the searching starts, are then compared with the database of

posts to find the best match. The users will be asked to give as much information

regarding the item that was stolen, time, and location during the reporting process. These

inputs will be rigidly defined to increase the accuracy of the system, either by leveraging

external domain knowledge or later extracting the domain knowledge from the

classification clusters. This input for the system could be pulled either from a direct

interface engine or from various law authority databases for stolen property such as

Trace, America’s largest law enforcement database for reported stolen property. For this

case study we will assume someone reports that their Blackberry Bold 9900 was stolen

on April 10th 2014 around 10pm from their residence, the items condition is moderately

used with a notable scratch on the screen. The system would check for posts that had

been classified as suspicious and as “Blackberry bold 9900” that were posted after the

time of theft and in close proximity to the location of theft. Matching posts would further

be refined by checking for keywords that may indicate the condition of the item or

indicate the identifying characteristic of the scratch mark. However it should be noted

that this secondary process would not remove candidates from the matching process but

simply increase the confidence of those that matched additional characteristic, and

inversely decrease the confidence of those that displayed conflicting characteristics.

Matches found by the system would be returned to the police for further investigation.

Now that we have an understanding of how the proposed system works, we will

go into detail about each of the systems components in subsections to understand how

18

they function and design considerations that can alter how they function. We will begin

by looking at how the ad retrieval process was designed and how it functions.

3.1 Ad Retrieval

The automation of ad retrieval is key to allowing for a huge number of ads to be

checked, while also maintaining up to date average market prices for items. This involves

parsing entire sites for all postings, in the case of sites dedicated to the target domain, or

parsing specific portions of sites, in the case of sites with multiple domains such as

Craigslist. Currently all sites that we want to be parsed must be manually defined

however future improvements may allow for automated site discovery.

For dedicated sites a simple regex matching the HTML anchor tags “<a href>”

and “<\a>” will allow for all links within the site to be parsed that are accessible to the

public. The links can be restricted by using relative or absolute references to the current

site domain, removing the possibility of parsing outside the site domain. It should be

noted that parsing the entire site can result in many pages that do not have relevant

information, such as the main page, indexing/search pages, terms of service, etc.

However this does not pose a problem due to how the classification process occurs,

which is described in detail in section 3.2. For the purposes of this section just know that

these pages will not be classified due to lacking characteristics and thus will not impact

the classification or rule extraction phases.

For sites that have multiple domains such as Craigslist and Kijiji, a similar

approach can be used to restrict the links to be within a subdomain of the site or manually

specifying the structure of the page and which portions should be parsed. In some cases

manually defined characteristics of post pages must be defined, such as pages that have

no further links, to prevent cyclical parsing. An example of this would be pages that

contain a user’s ad may also contain a link to the websites home page, which would result

in cyclic parsing of the site if included. This can alliteratively be solved by keeping an

array of parsed URLs for a given site during the retrieval process, checking the array to

see if the page has already been retrieved.

19

Posts that are more than a few months old would be filtered prior to classification

to ensure that very old posts are not being parsed during the classification and price

extraction processes; preventing them from skewing the classification patterns or market

prices. For sites such as Craigslist, they restrict retrieving posts more than 6 months old

automatically; for sites that do not limit the age of the posts, they are filtered manually

during this process.

Once these ads have been parsed they are saved to either the local machine or a

database for reference should the ad be removed from the site. The name of each post is a

combination of the post ID/URL with respect to the site or a hash of the post content.

This serves two purposes, the first allowing the system to check if the page has already

been parsed and the second allowing the system to determine if the page has been

modified since its last downloaded. Now that we understand how the ad retrieval process

works we will look at how the classification system works.

3.2 Ad Classification

Given the unknown number of possible classifications, the system needs to

identify classification patterns automatically with as little initial domain knowledge as

possible. Limiting the amount of initial domain knowledge required keeps the system

flexible to recognize sub-trends while also requiring less human intervention to update

the domain knowledge. When looking at how this could be achieved, it was clear that an

AI approach would be needed but many factors impacted how it should be implemented.

The following sub-sections will look at each of these factors and how they were

addressed.

3.2.1 AI Model

When first looking at how such a system could be trained; there is the option of a

supervised system which would require labeled training data with explicit examples for

20

each classification or an unsupervised system which would require unlabeled training

data. Given that we do not know the number of classifications that actually exist, that a

large amount of domain knowledge is required to properly classify online posts, and the

amount of unlabeled online posts on the internet compared to the cost of labeling it; using

an unsupervised approach seems much more practical for this application. Although a

supervised system would be preferred for its faster training time, the cost of labeling

training data and the rate at which domain knowledge changes makes a supervised

system approach impractical.

When looking at how traditional clustering algorithms worked it was clear that

they largely only worked on data that could be numerically represented. In our case this

was not possible while maintaining the relations between the keywords. Since our

classifications were going to be keyword combinations, since we are dealing with natural

language as an input, translating this into a numerical representation would not be

possible until a classification had been done. As a keyword at position n would only be

related to its surrounding data n±y if the classifications are known when translating the

data into numerical form, where y is the range of the surrounding data. Another way to

look at this is maintain the linguistic characteristics that we want to identify while

translating the data into another form is very difficult, either requiring knowledge of these

linguistic characteristics in the form of domain knowledge or the system would need

supervised training data in order to derive them automatically. Due to this limitation we

had to look more specifically at which AI model should be used that would allow us to

solve the classification problem without translating data. We choose to take an

evolutionary computation approach, since our problem resembles the traveling salesman

problem, attempting to cover as many nodes as possible with a classification while not

overlapping other classifications.

Evolutionary computation algorithms traditionally functions as a global

optimization, leveraging the Darwinian process of cross-over and mutation altering

existing encoded data or chromosomes. The most basic approach is the genetic algorithm

or evolutionary algorithm, which has a population of size p whose chromosomes are

21

encoded with random values. Then cross-over is performed on the population based on

each population member’s relative fitness to the given problem. This results in members

that are closer to an optimum to have a higher chance of their encoded data to be in the

future populations. Mutation has a low chance of being performed on each resulted

member each iteration, this allows for members to break out of local optima and reach the

global optimum.

Considering these limitations an unsupervised AI approach leveraging differential

evolution to generate the classification patterns was chosen. Using differential evolution

over the traditional genetic algorithm provides the benefit of restricting child

chromosomes to have a higher fitness value than their parents. This provides a more

predictable convergence to optima while keeping high fitness chromosomes in the

population. Modifications to the transition process are covered in section 3.2.5 which

relaxes the transitions from exclusively higher to conditionally lower with another metric

being higher.

3.2.2 Chromosome Representation

One of the most important design considerations for any AI system is how the

data can be represented, in our case how the chromosome should be encoded, such that it

can be maximized appropriately. When looking at how the chromosomes could be

represented logically, it made sense to have them represented by an array of keywords;

especially given the difficulties presented earlier regarding the translation of natural

language data. We chose to converge to a keyword optima before incrementing the

chromosome array; this allows optima to be found for each array length rather than

encoding the array length in the chromosome. The reason for this is that we don’t know

the optimal array length and encoding it within the chromosome would result in lower

array lengths being better or “less wrong” since there would be less keywords to impact

the fitness of the chromosome; because of this issue having the chromosome length

increase each iteration results in more accurate and consistent convergence. This process

has the byproduct of producing a keyword hierarchy for the classification. The

22

classification process takes a “one vs. all” approach, attempting to find a single pattern

that classified the largest number of posts within the current training set. Each iteration

increases the chromosome length until the fitness of the pattern decreased from the

previous iteration’s optimum. This ensures that the classification pattern is complete and

that all words capable of belonging to that pattern are reasonably exhausted.

Similar to the issues discussed in [15], where text summarization would tend

towards words like “the” unless comparative weighting such as TF-IDF is used; the issue

of transitional words or “noisy” words being prime candidates for classifications patterns

was discovered. To address this issue we introduce the concept of a training set and a

global set. The training set is the group of data that we want to derive classification

patterns from, while the global set serves as a uniformly distributed set of various non-

target domains to dampen this noise. This allows for word frequency in the training set to

be weighed against the global set to determine if the word was truly unique to the target

domain. This was done for two reasons; the first being to remove common words that

would appear is all posts, and the second being that it allowed for a recursive structure to

further refine classification sets. An example of this would be subset of posts that

matched a specific keyword pattern, say “Blackberry Bold 9900”, could be input as the

training set while all other posts containing the keyword “Blackberry Bold” would be

input as the global set. This allows for more specific subcategories to be recognized.

3.2.3 Fitness Overview

The next largest design consideration is how the fitness of the chromosome

should be measured. The fitness calculation is absolutely crucial to the accuracy of the

system given that it controls how optimization occurs. In order to understand how fitness

will be calculated we must first discuss the concept of coverage and how we want the

fitness to behave within the system.

23

Coverage is defined as the following:

The number of posts that contained a given word at least once, for multi-word

patterns the intersection of these sets is taken.

Coverage was chosen over raw word frequency to avoid the amplified noise from

transition word. This concept of coverage is heavily used throughout the fitness

calculation.

The training and global sets help identify the linguistic patterns that are unique to

the target domain without any prior knowledge of the target domain. Unfortunately there

are often many linguistic patterns in the training set, and with the nature of maximization

algorithms, they will naturally converge to a single optimum. Preventing local optima

from being discarded in favor of other classification optima is quite difficult but is

required to discover multiple classification patterns that exist in the training set.

Specifically in genetic algorithms or differential evolution, the algorithm will attempt to

maximize a chromosomes’ fitness value, so there must be a way to preventing lower

fitness patterns from being removed from the population if it is potentially valuable.

Similar to how differential evolution improves the genetic algorithm by only allowing

chromosome transitions when the fitness is higher, we introduce other restrictions to

achieve this. We limit what transitions could occur between chromosome values such that

it met the criteria below:

The fitness value of the new chromosome should be at least 90% of the current

chromosome’s fitness. The variation in percentage allows for the transition to

non-increasing fitness supersets and then later into more specific subsets.

Although fitness does account for the coverage size of a set, having this variation

allows for more consistent transition to the global optimum for a given

classification pattern.

24

The new chromosome must be a superset of the current set described by the

chromosome or the ratio of training posts coverage lost must be outweighed by

the ratio of fitness gained. This limits the number of transitions between similar

fitness optima.

These restrictions prevent lower fitness patterns from converging to a single

global optimum even when there is a small fitness gain. An alternate approach restricting

transitions based on a fitness gain thresholds displayed poor results, slowing down the

convergence while only protecting the n-best optima within the thresholds window of the

global optimum. Although the two restrictions above do not inherently prevent all

chromosomes from converging to a single optimum, it does assist in resolving multiple

clusters that have little to no intersections if multiple patterns exist. This property of not

explicitly restricting transition outside of its current subset is required to allow

chromosomes initially seeded with noisy patterns to transition into subsets of actual

patterns. An example of how these restrictions would be applied can be seen in Figure

3-2, a transition from a chromosome with the word “bold” to the word “Blackberry”

would be allowed since most of the training posts that contain the word “bold” also

contain the word “Blackberry”. While a transition from a chromosome with the word

“IPhone” to the word “Blackberry” would not be allowed unless the difference in fitness

values was extremely large. In this case the “IPhone” subset would likely have

insufficient representation in the training set and be interpreted as noise until a

classification pattern removes “Blackberry” and “Bold” from the training set.

Figure 3-2 – Visual representation of classification pattern sizes and their intersections.

Chromosomes with the word “bold” would be allowed to transition to the word “Blackberry” due to

their high degree of overlap.

25

In the event that there is insufficient representation of a pattern in the training set

then it would likely be interpreted as noise. An example of this can be seen in Figure 3-3,

however this will be resolved in later cycles of the classification algorithm when there is

either more data for the given set or the posts that belong to a classification exist the

training set. Now that we have a basic understanding of how the chromosome fitness will

be evaluated, we will go into detail about how this is calculated.

Figure 3-3 - Frequency analysis of words in training post set; true optima “Blackberry” and “9900”

can be seen while “Bold” is obscured by noise. Sub-trends such as “IPhone” are present but are

obscured by noise.

26

3.2.4 Fitness Function

Coverage can be more formally shown as the following equation:

𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆 = ∑ 𝑷𝒐𝒔𝒕 𝑺𝒆𝒕𝒊 ⋂ 𝑪𝒉𝒓𝒐𝒎𝒋

𝑪𝒉𝒓𝒐𝒎𝑺𝒊𝒛𝒆

𝒋=𝒊

𝑷𝒐𝒔𝒕 𝑺𝒆𝒕𝑺𝒊𝒛𝒆

𝒊=𝟏 (1)

The fitness calculation is broken into 3 components listed below:

𝑭𝟏 = 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎

𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎−𝟏 , 𝒘𝒉𝒆𝒓𝒆 𝟎 ≤ 𝑭𝟏 ≤ 𝟏 (2)

The first factor calculated with Equation (2), checks if the new word has value

compared to the previous n – 1 chromosome words relative to the training set. F1 is

bounded within these ranges due to the fact that this fraction can never exceed 1 and by

the second iteration the previous chromosome will have converged to a non-zero fitness

value which would require that F1 have a non-zero value during the first iteration,

resulting in the denominator having a non-zero value in the second iteration. During first

iteration when there is no previous chromosome value to compare with, initial

experiments used the entire training set size but this proved problematic as the

chromosome size increased since the previous chromosome would match much fewer

than that. This was effectively assuming that the previous pattern would match

everything, which could only be true if a single pattern existed in the training data. This

was later altered to have F1 being set to 1 for the first iteration. This change was done for

one main reason which cascades into many other calculations; it makes the F1 fitness

values comparable between iterations which in turn prevents poor fitness patterns in the

next iteration from having higher relative fitness to legitimate patterns in the previous

iteration. This not only prevents legitimate patterns from being removed after the first

iteration but also from artificially extending the pattern past its true length due to

incomparable fitness values between the first and second iteration. Although this has a

slight inverse effect on the first iteration, giving them slightly higher fitness values, this is

offset by the fact that as the chromosome size increased it is assumed that F2 would

alleviate the difference in fitness for highly overlapping sets; in which it is assumed that

the fitness value of F1 will remain reasonably close to 1.

27

𝑭𝟐 = 𝑪𝒉𝒓𝒐𝒎𝑺𝒊𝒛𝒆

𝒌 ∙ 𝑨𝒗𝒈 𝑷𝒐𝒔𝒕𝑺𝒊𝒛𝒆 , 𝒘𝒉𝒆𝒓𝒆 𝟎 ≤ 𝑭𝟐 ≤ 𝟏 (3)

The second factor calculated with Equation (3), weighs the length of the

chromosome; this is to offset the fact that as the chromosome length increases the subset

of training posts will decrease even for legitimate patterns. k represents a constant, which

varies based on the training set, it is used to normalize the impact of increased

chromosome length has on the fitness; such that chromosome length positively impacts

the fitness while the pattern is not overextending past the true trend length. Manipulating

the value of k will impact the value of the F1 and F3; increasing the value of k will

depreciate the value of the chromosome length and thus increase the relative weight on F1

and F3, conversely decreasing the value of k will increase the value of the chromosome

length and thus decrease the relative weight on F1 and F3. For the purpose of our

experiments we left the value of k as 1, which should be the case for new training

domains. Although F2 is normally bounded with the ranges of 0 and 1, there is a case

when F2 is unbounded; if the average post size is very small and the classification

pattern only exists within the larger posts then it is possible for F2 to achieve a value

greater than 1. In the event that the chromosome length ever surpasses the average

post length relative to k, then the iteration should terminate early due to instability

introduced by having values greater than 1. However termination should occur

naturally before this due to the fact that all matching posts with length smaller than

the new chromosome length will no longer match, resulting in a significant decrease

in F1; this decrease should not be able to be offset by the increase in F3.

𝑭𝟑 = 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎 − 𝑮𝒍𝒐𝒃𝒂𝒍 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎 ∙

𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕𝑺𝒊𝒛𝒆𝑮𝒍𝒐𝒃𝒂𝒍 𝑺𝒆𝒕𝑺𝒊𝒛𝒆 − 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕𝑺𝒊𝒛𝒆

𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕𝑺𝒊𝒛𝒆 (4)

The last factor calculated with Equation (4), determines the percentage of training

post matched compared to the number of global posts matched, this comparison is

normalized based on the respective size of each set. Although in the equation only the

global set is normalized because it is assumed that this set will be much larger than the

28

training set. F3 is bounded within the ranges of -1 and 1; this lower bound exists when the

pattern is more predominant in the global set then the training set.

These factors are then combined using a harmonic mean with Equation (5) to

produce a single metric that balances the various factors. This allows for variations in

parts of the fitness while heavily penalizing the fitness if a single attribute has a low

value. Harmonic means are used heavily throughout NLP classifications and such we

extend its application into the calculation of our fitness. A figure of the harmonic mean

between two values can be seen in Figure 3-4; where averaging the values would produce

a plane between 0 and 1, this curves the plane such that the greatest result occurs when

the two values are very similar to each other.

𝑭𝒊𝒕𝒏𝒆𝒔𝒔 = 𝟑 ∙ 𝑭𝟏 ∙ 𝑭𝟐 ∙ 𝑭𝟑

𝑭𝟏 + 𝑭𝟐 + |𝑭𝟑| , 𝒘𝒉𝒆𝒓𝒆 𝟎 ≤ 𝒇𝒊𝒕𝒏𝒆𝒔𝒔 ≤ 𝟏 (5)

The absolute value of F3 must be taken since its negative lower bound produces

an asymptotic plane where F1 + F2 = − F3. This allows a harmonic mean calculation to

incorporate F3 without substantially changing the shape of the function; in Figure 3-4

below we can see the normal shape of a harmonic mean between two elements spanning

the ranges of 0-1, while in Figure 3-5 below we can see the effects of allowing F3 to have

negative numbers while taking its absolute in the denominator.

Figure 3-4 – Harmonic mean of F1 and F2. Figure 3-5 – Harmonic mean of F2 and F3.

29

3.2.5 Population Transition

As previously mentioned, restrictions had to be put in place to prevent lower

fitness patterns from converging to a single higher fitness patterns. These conditions are

discussed in Section 3.2.3; firstly the test chromosome must have at least 90% of the

member’s fitness, and secondly the test chromosome must either be a superset of the

current chromosome or the ratio of coverage lost must be outweighed by the ratio of

fitness gained.

The second value is defined by the following condition:

𝑭𝒊𝒕𝒏𝒆𝒔𝒔 𝑪𝒉𝒓𝒐𝒎𝒕𝒆𝒔𝒕

𝑭𝒊𝒕𝒏𝒆𝒔𝒔 𝑪𝒉𝒓𝒐𝒎𝒎𝒆𝒎𝒃𝒆𝒓≥

𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎𝒎𝒆𝒎𝒃𝒆𝒓

𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎𝒕𝒆𝒔𝒕⋂ 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎𝒎𝒆𝒎𝒃𝒆𝒓

(6)

It should be noted that we only need to consider cases where coverage is lost due

to the fact that if coverage is gained, i.e. test chromosome is a superset, then the right

condition equals 1 and we are only looking at increasing or comparable fitness values.

Conversely the right condition increases in value as the sets separate or narrow, in which

case the ratio of fitness gain must be compared to that of the coverage lost. A graph of

this condition can be seen in Figure 3-6, where the transitions were monitored for a

generation to see which transitions were being allowed along with the range of transitions

occurring.

Figure 3-6 - Graph of passed vs. failed chromosome translations based on Equation (6). Red data

indicates failed translations; blue data indicates passed translations.

30

3.2.6 Termination Condition

Due to the nature of algorithm we cannot prove that it ever reaches the global

optimum, as such a termination condition must be used to approximate the amount of

time required for algorithm to potentially converge to the global optimum. Our goal

during the optimization is to find the best fitness chromosome classification for the

training set additionally finding any different classification patterns that may exist in the

training set if possible.

While most applications use either a fitness threshold or number of generations

before the system is assumed to be converged, seeing the results from Figure 3-7 we can

see that the population would often discover their best member relatively early on and

then simply have other members of the population slowly converge to this pattern. Given

this fact in our application we define the termination condition to be when the average

population fitness is greater than or equal to 75% of the current best fitness of the

population for 5 consecutive iterations. This percentage allows for the us to be relatively

certain that the best pattern has been found while the condition of consecutive iterations

ensures that it doesn’t terminate prematurely when the system is seeded with random

data; allowing the population to stabilize into various clusters before stopping, while also

assuming that from that point on lower fitness chromosomes were just converging to the

current best chromosome. This value would likely need to be tuned based on the training

Figure 3-7 – Displaying the average population fitness compared the best members fitness over 5

iterations. Also showing that termination should occur after the 3rd iteration due to the fitness

returned was lower than the previous iteration.

31

sets distance between clusters, the population’s size, and a variety of other factors for

different domains.

During our experiments we noticed that this can result in the algorithm not

converging in some datasets or converging too early based on the value of the termination

percentage. As a result we experimented with alternatively termination conditions that

had the same end goal. This included finding a polynomial curve that fit the population’s

average values and calculate the tangent at the current generation, if the tangent was

approaching zero for n consecutive generations then it was assumed that the population

had stabilized to its optima. This approach displayed similar results although being more

computationally expensive. Results of varying the termination condition can be found in

Section 3.5.5.

3.2.7 Data Clustering

Once this classification has been done, we assume that we’ve collected most of

the possible classification patterns that exist within the training set and continue to

prepare the data for processing the next section. This involves organizing raw posts into

clusters such that the data can be evaluated by the rule database and processed for rule

extraction. Organizing these patterns in a tree structure allows for hieratical matching

while allowing for partial classification patterns to be utilized in the future. An example

of this can be seen in Figure 3-8.

Figure 3-8 – An example of how data clustering creates a tree structure. Patterns such as “Blackberry

– Bold – 9900”, “Blackberry – Bold - 9800”, “Blackberry – Q”, and “Blackberry – Z” were merged to

form a tree structure due to their common parent “Blackberry”. This allows future patterns to be

appended to this same parent while also allowing for specific classifications or superset classifications

to be analyzed.

32

This structure allows for branches of tree to be analyzed for specific patterns,

which may be unique to specific classifications. An example of this would be, the

analysis of superset such as “Bold” may indicate these specific models have a unique

characteristic that is not displayed in other models, as such this rule can be applied to just

this portion of the hierarchy. It may that these models are only released a specific area or

are exclusively locked to a specific network, and thus sales outside of this area or on

different networks may be suspicious.

3.3 Rules and Rule Database

In this section we will define and derive the rules that are initially provided to the

rule database, discuss how the rules can be implemented, and how the rule extraction

process functions.

The following assumptions are based on the tactics used by Portland Oregon’s

Burglary Taskforce Unit to identify reported stolen property online, and as such it

assumes that the thief or low level fence is operating according to these assumptions.

These tactics may not be appropriate for career criminals however it should be noted that

the framework itself is flexible and can be extended to include other characteristics

confirmed to be indications of stolen property. For simplicity we will continue to discuss

our work based on the assumption that these characteristics are correct. The assumptions

are as follows:

The date and time of the posting will be relatively close to the time the item was

stolen.

The seller is in close proximity to where the item was stolen.

The item is being sold for less than the average market price.

These three characteristics will be consistent for most stolen posts for the following

reasons; the thief will attempt to sell the item after stealing it as we assume the theft was

opportunistic and not premeditated, the thief will steal property in a reasonably close

33

proximity to where they reside due to its ease of actability, and to dispose of the item as

quickly they will sell it for less than its current market value. These identifiers are

heavily utilized during the active searching process when comparing a post to a reported

stolen item, such as the date/time of the posting are relative to the time the item was

stolen and the sellers proximity to where the item was stolen both are relative attributes

and rely on reported information. That is to say we cannot quantify or extract

suspiciousness patterns solely based on the time or location of the post selling the item.

We refer to these three identifiers as primary attributes due to the strength of their

assumptions.

Additional elements may be used to identify suspicious items but are generally

harder to automate. These characteristics are as follows;

The post is using a stock photo or photo from another post for the item they

are selling.

The seller either negotiates or displays the desire to sell the item away from

their residence.

The post contains a poor description of the item and the seller does not appear

to have much knowledge of the item.

The post indicates that the seller is overeager to sell the item.

The post contains a telephone or contact information for the seller that is spelt

out or obfuscated rather than in plain digits.

These identifiers, including lower than average market price, are quite passive

characteristics which do not reference a reported stolen item. Due to the fact that these

identifiers are independent of reported cases we refer to them as secondary attributes

since they cannot be used to correlate a post to a reported case, however they can be used

to narrow down the set of posts required to be search in the searching process. The

reasoning behind these attributes will be discussed in more detail in section 3.3.2.

34

3.3.1 Application of Primary Attributes

As previously mentioned we must make a set of assumptions on how the thief will

operate based on the documentation that is available, validation of the accuracy of these

assumptions will be left for future work. In this section we will discuss how these three

primary attributes can be used to indicate whether a post is suspicious.

3.3.1.1 Date and Time

It is reasonable to assume in most cases the theft will be opportunistic and not

premeditated, the thief may research the average selling price beforehand but it’s unlikely

they will create a post for the item before the theft occurs. Because of this we can assume

that the thief’s post will be created after the time of the theft and will likely occur with

very little delay. Although pre-selling the item online before its stolen reduces the amount

of time the thief must hold onto the item, it draws additional risk since they must acquire

the item in a much shorter timeframe as the buyer assumes the seller has the item on

hand. The event of pre-selling an item online would likely only occur for very

identifiable and risky items, in which case it would eliminate the time the thief held onto

the item. However due to the complexity and narrow application of concept we will

ignore this case.

This attribute would measure the difference in time between the time of theft

indicated by a reported case and the time the post was created. It should be noted that a

variance of 12 hours should be incorporated to accommodate inaccuracies in the reported

time of theft; although the error in reported time of theft should be much less than this,

this servers as an upper bound for the margin of error. This feature would be weighted by

an exponential decay, with a half-life relative to the average time of sale. The average

time of sale could either be provided as initial domain knowledge or extracted from

previously posts as the amount of time a post was active. Although the latter method

doesn’t directly indicate the item was sold, it will either indicate the item sold or the

seller gave up on using this medium to sell their item. In either case these same

characteristics can be expected when then thief attempts to sell the item.

35

Weighting this value with an exponential decay assumes the risks of holding onto

stolen item increases over time, and thus that the thief will attempt to sell the item as soon

as possible. This would make it very unlikely to see the item being sold after 1-2 times

the average selling time, and because of that it can be assumed the item is already been

sold.

This is to say that even if a perfect match was found, if the post has been up for an

extended period of time then it’s unlike the item would still be in the possession of the

seller, and as such it’s unlikely the item would be recoverable at this point. Although this

is an unfortunate occurrence the system should be prioritizing both confident and realistic

leads that can result in item recovery.

3.3.1.2 Proximity

It is assumed that the seller will not transport the item a far distance, and

conversely will not travel a far distance to steal an item, such that the thief and the victim

will exist within relatively close proximity. However it was stated by the Oregon’s

Burglary Taskforce Unit that thieves would often attempt to sell the item away from their

residence or in adjacent cities, because of this we expect the location to be within the

surrounding area.

This attribute would measure the difference in location, however only using city

granularity as more detailed locations could lead to inaccurate trends. This feature would

be weighted by either a linear or exponential decay, having high certainty when within

the same city and lower certainty as we expand to outside the city; a Gaussian

distribution may alternatively be applied for areas surrounding the seller’s location since

it’s likely they could have stolen the item from one of these locations but beyond the

immediately surround cities this probability quickly decreases.

36

3.3.1.3 Price

It is assumed that the risk of holding onto an item for a prolonged period of time

is not worth the minor increase in price that could be gained, and thus it will be preferable

to dispose of the item as quickly as possible. To address this, the price attribute measures

the difference between the posts price and the average market price of the item. The

average market price and standard deviation can be extracted from clusters of posts with

the same classification. This allows for an estimated window of the current market price

of the item and posts with a price outside of this range are suspicious. Only the suspicious

prices that fall below the average need to be looked at, since suspiciously high prices

would contradict our initial assumption; although this case may be interesting we chose

to leave it for future work since it wouldn’t represent a large amount of stolen items.

The suspiciousness of the posts price can be defined by 1 − 1

𝑘 |∆𝑝| where ∆𝑝 =

|𝑃𝑀 𝑎𝑟𝑘𝑒𝑡−𝑃𝑃𝑜𝑠𝑡 |

𝑃𝑀𝑎𝑟𝑘𝑒𝑡 is the variance in price from the market average, allowing all posts to have

suspiciousness but only relative to their difference from the market average, an example

of this can be seen in Figure 3-9. K represent a domain specific designed to conform the

suspiciousness of the posts to that standard deviation from the market average.

Figure 3-9 – Price histogram of Blackberry Bold 9900 posts with their respective suspicion.

37

As the market price changes over time, an acceptable timeframe in which posts

can contribute to the average must also be defined, such that it isn’t inflated by previous

higher prices. For sites such as Craigslist, they restrict retrieving posts more than 6

months old, which is an acceptable timeframe for calculating a stable average market

price.

The standard market price, also often referred to as the market average, can be

calculated in a variety of ways. Although simply averaging all the prices of the posts

under a specific classification yields a value similar to that we’d expect to see, it is often

substantially lower. As a result more advanced methods must be used, however the

average price previously calculated does give us an idea as to where we should begin to

search for a true market price. As we will discuss in Section 3.5.3 creating a histogram of

the post prices allows for us to identify specific curvatures that indicate the true market

price.

3.3.2 Application of Secondary Attributes

These attributes are often much harder to quantify and apply, however they

provide a method of pre-filtering the posts before they are actively searched. They also

provide methods to infer suspicion, similar to the subconscious or “gut instinct”, and as

such should be explored. Although these attributes are much weaker indicators then their

primary counterparts, a single secondary attribute does not provide the same degree of

suspicion as a primary one; many secondary attributes in conjunction can strength the

suspiciousness of a post.

3.3.2.1 Stock Photos

This attribute attempts to account for the image that may be provided of an item.

One of the most predominant wants people attempt to identify their stolen item is through

uniquely identifying marks or characteristics, as such we assume that if this is the case

then the seller will attempt to conceal this by either not displaying a photo of the item or

displaying a stock image instead.

38

To check if the image is unique we take a hash of seller’s image (if provided) and

compare it to hashes of popular Google image results for that classification and other

posts images. This will tell us if either the seller has used a common Google image to

display for their item or is using another posts image; either of these is a strong indicator

that at the very least the seller is being dishonest as to the state of the item. This concept

could be extended to comparing features within the photos in attempts to find unique

characteristics of the stolen item however this goes beyond the scope of this system.

3.3.2.2 Poor Description

This attribute attempts to quantify the seller knowledge of the item they are

selling. We assume that the thief will not have a lot of knowledge of the stolen item,

which will be true for most opportunistic thefts. In some domains this may not be true,

such as common electronics; however this attribute may be more useful in other domains.

To determine if the seller has some knowledge of the item we check to see if the

seller is using a unique description of the item, similar to the manner of hashing their

photo, we can compare the seller’s description to commonly marketed descriptions.

However due to small portions of the description being altered such as the contact

information we cannot use hashing, instead we must use description comparison to find

how much of the description overlaps with other posts. An example of this would be if

the seller simply copies a technical specification in their post then we would consider this

suspicious. This may not always be accurate, specifically in the case of electronic goods

where sellers may often use copied descriptions, however duplicate descriptions in other

domains such as a rental property description would be very suspicious.

3.3.2.3 Eager to Sell

This attribute attempts quantify the seller’s eagerness to sell an item. All posts

will display some amount of eagerness since everyone making a post has the desire to sell

their item. However in comparison to other secondary attribute this attribute is a strong

39

identifier for potentially stolen items; since it’s one of the main characteristics we expect

from a their or fence it should be present in all stolen postings. Unfortunately this

attribute is quite difficult to quantify as it cannot be directly measured like other

attributes, however the seller saying they are willing to reduce their price or explain why

their price is lower than others can be used as indications.

This attribute may be identified from small common phrases such as “want to sell

fast”, “want to sell as soon as possible”, “need cash quick”, etc. Even these few examples

show the diversity of implied eagerness to sell, and could even be more subtly implied in

the tone of the post. However using short 3-5 gram common string we can attempt to

identify this attribute, furthermore these string could be compiled automatically from

suspicious posts. This concept will be explored more in the rule extraction phase as it is

one of the most promising attributes that could be applied in the classification of

suspicious posts.

3.3.2.4 Contact Information

As mentioned by the Oregon’s Burglary Taskforce Unit many sellers may attempt

to obfuscate their contact information by spelling it out or adding format characters to

prevent search engines from effectively clustering their posts. This poses a problem since

this attribute becomes quite valuable when used to cross-reference the seller’s telephone

or contact information with other posts. Checking for inconsistencies such as different

seller names or telephone numbers for posts within the same timeframe is highly

suspicious. Although it is understandable that people move or change phone numbers

over time overlapping information should not be common, an example of why this is the

case would be the fact that telephone companies must hold onto telephone numbers for 6

months before it can be reissued; as such this would be an appropriate window of time to

check for conflicting information. Additionally this can be used to identify the seller and

infer suspiciousness between posts by the same seller, as it is unlikely that a thief will

only be selling a single stolen item. This concept also increases the suspiciousness of

their posts if multiple posts by the same seller have a small degree of suspiciousness.

40

This can be achieved by indexing the seller’s full name (if provided) to their

contact information along with a date range of use, if there are many overlapping ranges

with different contact information then it is highly suspicious. Conversely by indexing

the seller’s contact information to their full or partial name along with a date range of use

will indicate suspicion if the same number results in multiple names for a given date

range. In the first case the full name must be used as any indexing based on only the first

would have collisions with many different sellers with different contact information.

While in the second case a partial name can be used since we expect that a number will

only map to one or two people at a time. Given that local number registries must hold

only a number for 6 months after it is released, it can be expected that there will be very

few instances where the same number maps to multiple legitimate sellers.

Now that we have an understanding of the rules that can be used, the rule process

as a whole can be generalized as an iterative probabilistic model matching each of the

characteristics defined by the user input, giving us a list of most probable to least

probable matching postings, and in the case of passive detection ranking the most to lease

suspicious posts. Although we have attempted to incorporate many rules into the rule set,

it is far from complete; however the reporting system can serve as a hybrid of

reinforcement learning and training set based on confirmations by the police as to

whether the post matched that of the reported case. It should be noted that we chose to

use a database approach for the rules to prevent portions of the code from needing to be

updated as the rules changed; as such we also offset the rule calculations to the rule

database. This allows for the external storage of rules and threshold values, making the

system slightly more scalable should multiple instances of the framework be deployed.

3.3.3 Rule Extraction

The rule extraction process occurs after the posts have been classified into

clusters, processing specific clusters in attempts to identify new patterns for that cluster.

This is intended to extract new patterns based on suspicious clusters that can be used to

41

identify other aspects of suspicion; however it can also be applied to legitimate clusters to

extract further metadata about the cluster. One of the approaches to determine the seller’s

eagerness to sell was searching for n-gram phrases that were common amongst all

suspicious posts, excluding the keyword patterns that were used to cluster these posts to

begin with. However a more relaxed approach of simply putting the suspicious clusters

back into the classification system could also produce interesting results; removing the n-

gram require would allow for catchy keywords or other aspects to be discovered.

3.4 Reporting Database

This system will be covered in greater detail in Chapter 4 however we will present

an overview of the system functions and requirements. The reporting database serves the

function of allowing the public or police authorities to submit the stolen reports into a

linguistically fixed database. This removes the issue of having to determine the context

and various other aspects of processing natural language before they can be entered into

the database. Users would be asked to select or fill in the information regarding their

stolen property, such as stolen item, make, model, color, etc. The input structure would

be rigidly controlled, leveraging domain knowledge of the target domain to prompt for

specific pieces of information such that there is no variance in the location of the

information and ensuring that no details are missed. The systems input structure would be

imported from domain expertise for each domain but could later be extracted or updated

directly from the classification system’s tree structure.

This concept is similar to Shiri et al.’s work on Thesaurus-enhanced search

interface in [16] leveraging domain information structure to iteratively present

information to the user such that logically progressed to the users desired target. However

their use of thesauri to assist in the searching process is difficult for this application since

domain specific thesauri would be required but are very difficult to find. As such we

would either be required to manually create them based off domain specific knowledge or

leverage machine learning to dynamically create and maintain it.

42

However since the system is using controlled fixed-field input from the reporting

system to compare against the natural language contained in the posts, there must be

some flexibility to deal with different ways people can describe things. An example of

this would be people may often use abbreviations or inference based on surrounding

context, such as “BB Bold” or “selling IPhone” without explicitly stating the

manufacturer. This concept can be extended to future linguistic abbreviations that aren’t

known of or haven’t developed yet. To address this we can leverage the semantic

relations between nodes in the domain knowledge tree to infer missing pieces of

information. An example of this would be a post containing the classification of “Bold

9900”, this classification displays a high degree of overlap with the classification of

“Blackberry Bold 9900”, and as such we can infer that the brand would be “Blackberry”.

From this inference we can also learn new linguistic abbreviations as the data is

processed and alias searching to include those keywords as well. For example if “BB”

was detected as an abbreviation for “Blackberry”, then users looking for “Blackberry

Bold 9900” would result in searches for “Blackberry Bold 9900”, “BB Bold 9900”,

“Bold 9900”, etc.

This aliasing or semantic information can be extracted during the classification

process. Should two classification patterns display large amounts of overlap, say 80-90%,

then we can assume the two classifications are related. This holds true for a majority of

cases since upper portions of the classification tree will overlap for child nodes of the

same parent node, and thus their relation is the parental node. However we cannot

guarantee that the differing portions can be aliased or allowed to be traversed during the

comparison process until other attributes are examined. Given the following classification

pattern examples “Apple – IPhone – 5 – 64GB” and “Apple – IPhone – 5s – 64GB”,

although these two classification patterns share a large degree of overlap in their

corresponding keywords we cannot alias “5” and “5s” due to these being different

models. This can be confirmed when comparing the average market price and standard

divination for each. However if we introduce the pattern “Apple – IPhone5 - 64GB”

although this shares less similarities with the previous patterns when comparing the

market price and standard deviation with that of the posts selling the IPhone 5, we would

43

see that the two classifications have a large overlap in price. This fact along with high

overlap within the parent nodes would be a strong indication that the two classifications

can be safely aliased. This allows the framework to handle various formatting methods

that users may use to indicate their item in posts, including potential spelling mistakes or

abbreviations. Additional measures could be introduced such as regular expressions

applied to each classification to determine if the differences are just formatting characters

such as spaces or dashes.

44

3.5 Experiments

In this section we will discuss a series of experiments preformed in attempts to

measure and validate various components of the system. These experiments will vary

from quantifiable tests such as accurately classifying posts to unquantifiable tests such as

classifications used in rule extraction. We also describe the performance analysis of the

system and how various parameters can be tuned based on the desired results for a given

domain.

3.5.1 Experimental Environment

In our experimental environment a workstation running the proposed framework,

as shown in Figure 3-10, is used to retrieve ads from various online classifieds websites.

The proposed system is developed in C++ with Microsoft Visual Studio 2010 on run on a

workstation with an AMD FX-8350 3.0 GHz process and 16 GB of RAM; utilizing the

Boost[17] library for regular expressions and lexical casting, and the GSL[18] library for

polynomial curve fitting. The primary source of ads were taken from Craigslist “mobile”

section [19], it’s estimated that 200,000 posts were analyzed between February and May

in 2013; getting all posts from Craigslist Canada and US sites. Initial experiments were

Figure 3-10 - Diagram of process of the framework undergoes while determining suspicious posts.

45

limited to a single site due to having to strip off irrelevant html code that could

potentially cause problems during the classification process.

3.5.2 Experiment – Online Post Classification Accuracy

This experiment is to test the classification pattern extraction accuracy which

serves to later classify the posts, thus the accuracy of the extraction directly relates to the

accuracy of later classifications.

3.5.2.1 Single classification

In this experiment the intent was to extract the first keyword from a given set of

posts that would best describe the set of posts (i.e. the keywords that occurred in a

majority of the posts). We refer to this test as non-uniform due to its distribution over

cellphone models is neither balanced nor natural; the training data only contains a single

model of cellphone. Looking at roughly 300 cellphone posts on craigslist, we can see

from Figure 3-11 below that the keyword that occurred the most was “IPhone” since most

of the differential evolution population converged to this point. While other spikes are

introduced by “noisy” transition words such as “With”, “Or”, and “To” which are

substantially reduced by the inclusion of the global set but still indicate a significant

problem. The depth of the diagram indicates the generation with respect to differential

Figure 3-11 - Non-uniform distribution of training set posts; showing optimum occurring for

keyword “IPhone”, with noise around words “with”, “or”, and “to”.

46

evolution, and from this we can see that as expected many major trends were not

discovered until much later in the convergence.

Continuing the experiment with a different dataset containing Blackberry models,

we found that as the keyword requirements expanded the accuracy of the classification

patterns increased. As we can see below in Figure 3-12, after the 3rd iteration the highest

fitness keyword pattern was “Blackberry – 9900 – Bold” which is enough to identify the

posts. It should be noted that although this dataset should have only contained a single

classification pattern, the results returned by Craigslist contained multiple models.

Because of this we can see based on the fitness values, “Blackberry – Bold – 9900” is a

very strong trend while “Galaxy – S3” is a much weaker trend. Although “Galaxy – S3”

is a legitimate pattern, due to their being stronger classification patterns in the third

iteration as indicated by the fitness values, this weaker trend is effectively ignored at this

time. In later iterations when posts classified by “Blackberry – Bold – 9900” not included

in the dataset this trend would be more visible. These weaker trends can exist for a

variety of reasons when a post contains multiple models, the most prominent reason for

this is simply to increase the number of search results that return this post. Slightly more

legitimate reasons are when a post is attempting to sell multiple models or references a

new model for the reasons why they are selling their cellphone.

Figure 3-12 - Highest fitness classification patterns for three iterations; in the first iterations we can

see a lot of noise but in subsequent iterations this noise is substantially reduced as the legitimate

“Blackberry” trend extends.

47

It should also be noted that currently keyword order does impact the fitness of the

classification, although this is often very minor it is the result of the fitness function

weighting the fitness of the current iteration against the previous or base iteration. An

example of this is the fact that during the second iteration the pattern “Blackberry –

9900” had a very high fitness while “Bold - Blackberry” had a much lower fitness,

resulting in “Blackberry – 9900 – Bold” having a higher fitness then “Blackberry – Bold

– 9900”. Although this is irrelevant from a classification perspective it prevents very low

fitness words from appending true keywords and producing comparable fitness values.

3.5.2.2 Multiple classifications

We also attempted to classify datasets that had multiple classification patterns

within them, in this case a uniform mixture of posts from “Blackberry” and “IPhone”. In

Figure 3-13 we can see that both patterns are present. Similar to the previous experiment

we can see that there still exists some noise but there are two clear major trends,

“Blackberry” and “IPhone”. However this result is different from the previous

Figure 3-13 - Uniform distribution of training set posts; showing optimum occurring for both

“Blackberry” and “IPhone” keywords.

48

experiment as the fitness of the “IPhone” trend is comparable to “Blackberry”, and

although the “IPhone” trend is weaker it wouldn’t be interpreted as noise due to having

sufficient representation in the dataset.

Posts that exist in multiple classifications may have equal membership to each,

and thus lower the confidence of the membership and subsequent “suspicious” criteria.

This likely explains why the strength of “Blackberry” and “IPhone” trends were not

equal, since some IPhone posts likely also contained the “Blackberry” keyword.

If we look more specifically at some of the posts within the dataset we see some

that contain multiple classification patterns, such as Figure 3-14, which contains the

patterns “Blackberry – Bold – 9900” and “BB – Bold – 9900”. From the systems

perspective these are different classifications and would be treated as if it contained two

completely different models within the post. However if this information is actually an

alias as it is in this example, we would expect to see large overlapping portions within the

classifications. This information can be extracted if we look at “Blackberry – Bold –

9900” and “BB – Bold – 9900”, due to the large overlap in the later parts of the keyword

chain “Blackberry” and “BB” can be aliased. This aliasing information can be applied

either directly on the posts via per-processing or afterward by mapping the two terms in

an aliasing database and used by the reporting system.

49

3.5.3 Experiment – Price Extraction and Market Price

In this series of experiments we attempted to extract the price from clusters of

posts, prices were extracted from the posts using a regular expression searching for a

monetary symbols such as “$” followed by a series of numbers. Using the same dataset

we manipulated the gradient of the price histograms in attempts to extract information

about the structure of the data. In Figure 3-15 below we can see a price histogram for the

classification “Blackberry Bold 9900” which contains a few interesting characteristics,

the average falls around the value of $210 which very steeply falls off around $150. We

can also see that there are more posts with slightly lower than average prices then slightly

above average prices, this is expected due to the fact that undercutting other posts prices

will happen in this type of market. There also exists a peak at the lower end of the prices,

which is likely due to accessories or services being advertised for the product.

Figure 3-14 – Example of post containing aliasing information for “Blackberry” and “BB” based on

extracted classification patterns. Figure 3-15 – Price histogram of Blackberry Bold 9900; indicating the average price is around $210,

with more posts having a slightly lower price than that.

50

Looking at the data again with a smaller gradient in Figure 3-16 we can see that

the similar trend exists, as outlined with black there are services being advertised at the

low price range and there is a slightly slanted curve around the average from price

undercutting. However sections highlighted in red indicate areas that are highly

suspicious, they exceed the trends expected. The first is a very high peak around the $100

mark, which could either indicate higher end services or indicates very low prices for

selling the item. This also occurs around the $150 price point as well.

In either case we see 4 major types of prices, the $1-70 range are expected to be

garbage data that hasn’t been filtered properly into sub-classifications, advertising

peripherals or services such as batteries or repairs. The $70-140 range is the suspiciously

lower priced posts. The $175-240 range is assumed legitimate posts, with the possibility

of searching the lower bound with slightly relaxed suspicion. And finally the $275-400+

range which is above the market average, which do not need to be searched since it is

unlikely that a stolen phone would be being sold above the market value.

0

20

40

60

80

100

120

140

160

10

40

70

10

0

13

0

16

0

19

0

22

0

25

0

28

0

32

0

35

0

39

0

45

0

48

0

53

0

58

0

78

0

10

00

Nu

mb

er

of

Po

sts

Price

Blackberry Bold 9900 Price Histogram

Series1

Figure 3-16 – More granular price histogram of Blackberry Bold 9900; identifying suspicious

peaks at the $100 and $150 ranges.

51

Finally looking at the Figure 3-17 below with a slightly more coarse gradient we

can see that the average price is surrounded by very sharp declines which gradually

increase between the ranges of $175 and $230, following by another sharp decline. This

characteristic is interesting as the edges of the average range are also met with sharp

edges or bounds occurring at the $175 and $250 price points. Although this could simply

be disjoints in the data it could also be clear indicators of the bound of the average, and

from this what data falls below it.

Excluding the upper and lower 20% of prices resulted in a price histogram in the

figure above, with an average price of $182.50. Although this average is slightly lower

than the true average of the phone, it is still close enough to the $200-210 peak to identify

that section of the graph. As previously stated this was expected due to including the

“suspicious” prices into the average calculation. Although these experiments only

manipulated the gradient of the same data, each produced a unique result which is useful

in the market average price extraction process.

Figure 3-17 – Blackberry Bold 9900 Price Histogram resembles that of function |𝐬𝐢𝐧 𝒙

𝒙|; also

displaying clear boundaries at the edges of suspicious pricings.

Blackberry Bold 9900 Price Histogram

52

3.5.4 Experiment – Rule Extraction on Suspicious Clusters

Before beginning the experiment of testing the rule extraction process on

suspicious clusters, we verified that there were posts in the suspicious clusters that were

both suspicious and contained desirable patterns that we would want to extract. Below are

two examples of posts that indicate there are more complex patterns to identify, the first

in Figure 3-18 demonstrates the persons desire to sell the item quickly by the use of

“ASAP”. Additionally the use of “cash pick up” could also be an indication that the item

is stolen, alternatively it could be an indication as to why the price is lower as they aren’t

willing to travel at all to sell it. While the second post in Figure 3-19 demonstrates much

more complex patterns that we would want to identify, firstly they are using a stock

image, and secondly they require meetings in person. Although they provide extensive

information regarding the item they don’t provide any external method of contacting

them. Although this is largely speculation as to what these potential indicators could

mean, it is more important that the system be able to recognize these patterns.

Figure 3-19 – Post of suspicious cluster that

uses a stock photo and requires that the

transaction occurs in person.

Figure 3-18 – A post from a suspicious cluster

that has a low price and indicates that only

“cash pick up” is acceptable, along with the

indication that they want this transaction to

occur as soon as possible.

53

This experiment was conducted on the suspicious cluster of “Blackberry – Bold –

9900” posts which contained roughly 100 posts. Unfortunately due to the small training

size no useful patterns were extracted from the set. For domains that lack sufficient

suspicious posts, in order to continue the experiment a fundamental assumption would

have to be made in order to increase the training set size, assume that the patterns among

various branches or even in a domain are consistent. If this assumption is true we would

still expect to see domain specific trends become undistinguishable from noise, but

expect to see broad domain trends emerge. If this assumption is not true then it’s likely

that no trends will emerge and would require us to further analyze each domain

independently once larger suspicious clusters are obtained. Due to our lack of suspicious

data in other domains we attempted the same process on another domain.

This experiment was also conducted on the suspicious cluster of “IPhone – 4s”

posts which produced similar results to that of the suspicious cluster of “Blackberry –

Bold – 9900” when attempting to extract patterns from the clusters as a whole. However

analyzing the cluster divided into the following price ranges; $110-135, $130-165, $165-

190, and $190-210 produced better results. Each of these price ranges contained 50-150

posts and reflects individual portions of the price distribution that are suspicious, as can

be seen below in Figure 3-20. It should be noted that the average price was roughly $200

and because of that the price range of $190-210 is acting as a control to compare the

other extracted trends against.

0

20

40

60

80

100

Nu

mb

er

of

Po

sts

Price

IPhone 4s Price Histogram

Frequency

Figure 3-20 – Price histogram of IPhone 4s; indicating that the average price was around $200 with a

few suspicious peaks around the $125 and $150 ranges.

54

The results that we were interested in were when the extracted patterns were

forced to follow an n-gram format, attempting to extract small strings unique to their data

set. It should be noted that the other price ranges were included in the global comparison

set for each experiment, so although the trends are weaker they reflect the absolute

differences between these sets.

Price

Range $110-135 $130-165 $165-190 $190-210

Patterns

CABLE -- ONLY

SHAPE

CONDITION

PROTECTIVE -- CASE

LOCKED

AMAZING -- CONDITION

PERFECTLY – BUT

ASAP -- TO

From a summary of the results that can be seen in the table above, we can see that

although postings were being pulled from cities all over North America a common

keyword in all the price ranges was “Toronto”, which may indicate that this product is

limited to Canada. Ignoring patterns that reference this city or commonly listed prices

($120, $150, and $200), we can see that the $110-130 range contains patterns such as

“CABLE -- ONLY” which indicates that components are missing, which explains why

the post would have a lower price.

Taking a look at the $130-165 price range we can see many patterns that reference

“shape” and “condition”, which are referring to the same thing; while patterns such as

“PROTECTIVE -- CASE” are also indirectly indicating that the item has not been

damaged but also attempting to add value to the item.

When looking at the $165-190 price range, we see many patterns referring to

“LOCKED”, which is referring to the fact that the item cannot be used on a different

carrier and thus has limited usability and audience. We also see patterns such as

“AMAZING -- CONDITION” and “PERFECTLY -- BUT” which may be exaggerating

the condition of the item. Interestingly the only occurrence of the pattern “ASAP -- TO”

55

was in this price range, which is the only direct display of the sellers desire to sell the

item quickly and is one of the more predominant patterns in this dataset.

Finally looking at the $190-210 price range, where the market average is located,

there are no unifying trends regarding the extracted patterns.

Although in this case these classifications themselves cannot be directly translated

into rules in their current condition, they can be used in conjunction with NLP techniques

such as sentiment analysis to determine a posts membership to a given price range. This

is useful as it allows for the ability to quantify some previously discussed attributes such

as the posters eagerness to sell.

3.5.5 Experiment – Performance Analysis

In this experiment the intent was to analyze the performance of the classification

system by examining how the systems population would change over time. This

experiment involved running the classification algorithm 5000 times on a set of 4000

IPhone 5 posts in order to determine at which point the systems could assume a correct

classification for the set.

The results which can be seen in Error! Reference source not found. show how

the percentage of iterations that discover either the global optimum “IPhone” or the local

optimum “5”, both of which will lead to the correct classification pattern. The best,

average, and worst lines refer to the margin between the best members fitness in a

population compared to that of the populations average, these are then scaled relative to

fitness of the known optimum “IPhone”. From this graph we can conclude that any

further generations past 90 ultimately doesn’t greatly improve single optima discovery

but simply transitions other members of the population into the discovered optimum.

However achieving this can be done in different manners, given our termination

condition is composed of two components we can either select a very low margin and

low consecutive generations or a very high margin and high consecutive generations. The

56

reason why consecutive generation requirements are in place is account for poorly seeded

data that results in a smaller margin than expected, an example of this can be seen in the

worst case where the margin is 15% lower than it should be.

It should be noted that our termination condition selected was extremely

conservative, 25% margin for 5 consecutive generations’ results in the correct

classification of even the worst case.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 6

11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

101

106

111

116

121

126

131

136

141

146

151

CONVERGENCE OVER GENERATIONS

IPHONE 5 UNKNOWN WORST % AVG % BEST %

Figure 3-21 – Performance analysis graph; showing the percentage of iterations that achieved at least

one member of the population to the global optimum “IPhone”; Best, Average, and Worst lines

indicate the margin between the best member in the population and the population average relative

to the fitness of “IPhone”.

57

3.6 Discussion

3.6.1 Ad retrieval

The amount of time spent in the ad retrieval stage is quite significant, and as the

number of input sources increases the performance of the system will degrade, as much

of its processing will be maintaining an up to date database. To address this issue we

looked into leveraging the http status code 304 “Not Modified”, which would allow for

checking if a page has been modified (either search index or leaf pages/posts) without re-

downloading it. Although this has not been fully implemented it would make sense to

simply maintain the connections with the input sites and periodically check for

modifications to the posts along with new posts. This is due to the fact that the posts

could be altered shortly after creation to correct something or add additional information

which would have been missed in the initial download.

Alternatively the files could be re-downloaded and page hash compared to that

stored locally, although this still consumes bandwidth and some processing resources, it

removes the need to fully reprocess the pages.

3.6.2 Price Extraction

One of the main drawbacks during the rule extraction process is that if no

additional trends emerge to support the suspiciousness then the only suspicious attribute

being checked is the posts price. This is due to the fact that other primary attributes

cannot be leveraged at this point since they require a user submitted report to compare the

time and location components.

Due to this newly released items will not conform to traditional price models and

there will not be enough data to extract suspicious characteristics. This becomes a

significant problem if the extracted features change faster than the time it takes to

compile an adequately sized training set such that the trend is extracted. An example of

this would be the price feature, when there was enough posts to classify a new type of

58

item the extracted price would be incorrect due having the items launch prices and the

most recent prices both included in the price calculation. This can somewhat be resolved

when excluding a percentage of the upper and lower price ranges, however it only solves

the issue for posts containing higher prices. An example would be when a newer model

of the current item is released we would expect to see a price drop in the value of the

current model, if these are excluded from the price calculation then it distorts the price to

a higher average then it’s true value, resulting in significantly higher rate of false

positives. This can only really be solved by providing more domain knowledge to the

system about expected or newly released models, as the system would not have enough

posts to properly classify the new model by the time the old models value dropped.

However it should be noted that the price component is modular and can be changed to

account for this in the future should a better technique or price model be developed.

Duplicate posts on multiple sites also introduces issues in the market average

price being bias towards that price (either higher or lower) due to multiple posts with that

price are being taken into account when calculating the market average. This can be

addressed to a limited extent by hashing the content of the post and checking between

sites, but this may introduce errors and delay.

Price inconsistencies within a post have also been noticed, introducing issues

since we must determine which price to use in the market average calculation. For our

Figure 3-22 – An example of possible semantic analysis; determining the correct price by the distance

between the posts classification and the prices listed within the post.

59

work we chose the highest price contained in the post due to noticing that a large number

of posts that contained inconsistencies would have a lower price in the title to entice

prospective buyers to click their post and then have a slightly higher price within the

body of the post. However this is approach is not complete since there were also cases of

stating their original purchase price and then their asking price, which would result in an

incorrect price being extracted. More complex semantic analysis may produce better

results when weighting the distance between the prices to the approximated point of

product classification, an example of this can be seen in Figure 3-22. This weighting may

be inverted due to a majority of posts containing classification information in the title.

3.6.3 Rule Extraction

The rule extraction process assumes that there are keyword trends in the posts that

we would be able to find using n-gram or keyword classification. This assumption could

be wrong and the rules needed may be much more involved requiring tonal and inferred

NLP techniques. This ultimately comes back to the issue that there is no controlled

training and testing set that this model can be compared against. However if there are

simple trends that can be harvested, then it’s possible to generate an initial training and

testing set based on this framework for many domains.

60

Chapter 4

Reporting System

In this chapter we introduce the reporting system, which serves two major

purposes; the first is that it acts as an interface between harvested data from posts and

user’s reported stolen items, automating the searching process when users are looking for

stolen property. This allows the public or police authorities to submit reports of stolen

items into a linguistically fixed database, removing the need to determine sentence

structure, context, or other aspects of processing natural language before they can be

entered into the database. The second purpose while slightly more abstract, allows the

leads generated through the initial suspiciousness classification to become actionable.

Although suspicious clusters may identify key sellers, the authorities may not always

have the resources to investigate these sellers without reported cases of theft. That is to

say, without a user reporting a case of theft to the police and in turn the reporting system,

suspicious clusters cannot be actively investigated until their suspiciousness is

confirmed.

4.1 Design

Building on the designs of previous research in [16], the reporting process from

the user should involve traversing down a hierarchy of input parameters based on the

domain knowledge of the system. These relations are more formally denoted as Broad

Term (BT), Narrow Term (NT), and Related Term (RT), which indicate the relations

between various nodes within the domain knowledge tree. An example of how these

relations would be applied in this research can be seen in Figure 4-1; “Blackberry” and

“Apple” are both types of cellphones and thus have a NT relations with the node

“Cellphones”, inversely the “cellphones” node has a BT relation with “Blackberry” and

“Apple” nodes. While these two way relations are quite simple, there also exists a two

way relation of RT between “Blackberry” and “Apple” nodes since they are both NT of

cellphones. This pattern of relations will often occur at each level of the tree but will also

relate to other distant nodes based on the domain knowledge the tree was built on. The

61

concept of related terms can also point to orphaned nodes, which effectively aliases these

nodes allowing for these related nodes to indicate an alternative context for the input. An

example of this could be the “Blackberry” node being aliased to the “BB” node, which is

sometimes used as an abbreviation. This related information could also be used when the

user is inputting reporting information; giving them the contextual clues that Apple is

indeed the correct manufacturer they want to select based on other related products they

offer. If all relations of two way, then the division of BT and NT directionally is no

longer required and can be represented by a single traversable link relating two nodes.

Thus for the rest of this chapter the denotation of BT or NT will indicate unidirectional

relations while all unlabeled links between nodes will be bidirectional; an example of

this can be seen in Figure 4-2. This concept allows us to define traversal restrictions

within the domain knowledge tree. For example, normally we would want to allow

searching of other sub-branches of a tree to compensate for user input error but this

would not be desired in the case of traversing up to “cellphones” since this is changing to

a different major classification within the domain. There may be rare cases this transition

is allowed but results in effectively ignoring all the users input.

While not all related terms will be aliases for other nodes, they can also indicate

other categories that relate to parent node in other parts of the tree. The best example of

this is manufacturers such as Apple don’t solely make cellphones, so there will exist

multiple “Apple” nodes in subcategories such as Computers, Electronics (MP3 players),

etc. In this case the related terms infers other products that the company makes while not

merging the portions of the tree and destroying the hierarchy. It is also very important

how these relational links are traversed, particularly in the case of related terms when the

related node doesn’t exist within the same branch of the hierarchy. These relations may

Figure 4-2 – Limited annotated relations

between nodes of the domain information tree.

Figure 4-1 – Fully annotated relations

between nodes of the domain information

tree.

62

be best served to give the user context as to which node they are traversing to, such as

giving a list of subcategories of products that the company offers in order to give them

contextual clues.

Given that we now understand how we can use this information we must discuss

how we can retrieve user input as well as populate the nodes within the domain

knowledge tree itself. One of the cornerstone concepts of this system is that we have

complete control over the reporting element. This means that we can have strict control

over how the user inputs information as well as the format of this information. Having

the user input in a fixed format with only selectable options during the reporting process

removes the requirement for natural language processing as well as cuts down the

requirements for thesaurus or alias matching.

When considering what input information is needed from a user’s report, it can be

divided into the following two categories: report information and product information.

Report information would be information that is required for all user reports regardless

of domain, such as date and time the item was stolen, location, etc. Product information

would be domain specific information, where the amount of input and input formats

would be domain dependent, specified in the domain knowledge tree. This allows for the

report information to be formatted statically while the product information can be

formatted dynamically in relation to how the users is prompted to enter information

63

Given the fact there is a variable amount of information that we want the user to input

and that we want them to input as much information as possible, we must structure the

input format such that it is logically presented and not overwhelming to the user. We

prompt the user for product information relevant to a specific node before allowing them

to traverse down the tree. This also allows the domain information to be requested per

node based on the user’s response, while maintaining control of the structure of the input

data. In Figure 4-3 we can see how the user in prompted for more information based on

the previously selected input, each step expanding the required information. User

reported cases can be stored in one table containing both report information, such as date,

time, and location, and product information; specifically an index to what node within

the domain knowledge tree the report references.

Now that we have an understanding of how the user will be prompted to submit

their information, we need to address the issue of how the domain knowledge will be

imported and how much of this domain information will be used to get users input. There

are a few different ways in which this can be done, manually creating the hierarchy from

Root

Appliances

Cars/Trucks

Cell Phones

Electronics

Root

Cell Phones

Apple

Blackberry

Motorola

Nokia

Root

Cell Phones

Apple

IPhone 4

IPhone 4s

IPhone 5

IPhone 5c

Figure 4-3 – Displaying user input process as it traverses down the domain information tree; on the

left is an example of the user interface and on the right displays traversal within the domain

information tree.

64

a domain expert, automatically parsing manufacturers’ websites to extract product

information, or leveraging the tree from the previous classification process. While it is

clearly not ideal to use a domain expert to create the hierarchy due to increased overhead

and required human intervention, in some cases it may be required and would likely be

beneficial for them to review and correct an automated approach. Based on this review

an automated approach could then be improved to address discovered issues, lessening

the systems dependency on a domain expert. However automated approaches

unfortunately form a dichotomy; while parsing manufacturers’ websites we will get an

extraordinary amount of information about the product, it will also result in much of this

information being unused or implicit to the product anyways. In contrast leveraging the

previous section’s classification hierarch will get extract the information on average that

is used in a post, as this is how the classification process is derived, but our knowledge of

the domain is clearly incomplete and additional information may be beneficial in some

cases. These two approaches are the two extremes of information depth, complete vs.

minimal, neither of which is appropriate for the system. While completeness seems like a

better approach we must consider the user entering information into the system, if they

are required to give detailed information about their product that is unfeasible for a user

to know, then this becomes a usability issue especially when the requested information is

inherent to the product anyways.

This usability issue could be addressed by having some input fields be optional,

however some of the control over the inputs is lost with no real gain, as the fields’ people

don’t know would be left blank anyways. This problem is compounded in either case

when user’s attempt to give the system more information about their product then they

actually know; resulting in the users inputting false information that would incorrect

classify the products in the report. Some of these issues can be resolved by creating a

hybrid of the two approaches, requiring user input for fields that are known to the

classification system, while allowing optional input for any additional fields that are

created by the manufactures specifications. This doesn’t address the issue of users

entering misinformation by mistake and will be left for future work.

65

4.2 Domain Knowledge Structure

Domain knowledge about the reporting system can be stored in any manner;

however a database approach was chosen to allow for logical and intelligent queries to

directly access specific nodes. Having the design follow a tree structure along with

having a table for nodes and links provides the additional benefit of allowing for

recursive queries to be built in order to gather entire subsections of tree, while also

allowing for post indexing and searching to be processed within the database. This is

beneficial due to the ability to offset this processing load to another server or cluster of

servers, making the framework modular with respect to the classification and reporting

system; allowing for distributed domain expert classifiers mapping to a common

reporting system.

The first database would store all the report metadata information, specifically

date, time, and location of the theft. Each of these elements would require a unique

identifier to order them consistently within the generated reporting page, along with a

system name, user friendly name, input parameters/restrictions, loaded modules, and

module arguments. Both the system name and user friendly name are required since

metadata about the attribute can be stored within this name, such as UTC or GMT; which

would allow for time zone conversations based on the posters time to consistent time

format used within the database. The input parameters field acts as the source of input

sanitization, being applied both client and server side; such that if freeform is allowed it

is restricted based on conditions specified in this field. Additionally the module name

would be in place for 3rd party module that may be required for interactive elements such

as calendar to select the date rather than confuse the user as to the date format; along

with module arguments that may be required. An example of this can be seen in Figure

4-4 but could easily be extended based on the system requirements.

Figure 4-4 – An example of what the table structure and potential table input for report information

attributes.

66

This may seem unnecessary but similar concepts are required for all domain

information to keep them robust and extendable. A recursive query can be used to

resolve downstream branches of the tree, resulting in the following sub-nodes being

solved along with their respective attributes; this can be seen in Figure 4-5.

This would then be extended to retrieve the domain information regarding those

specific attributes, for example before transitioning down to more specific attributes

related to “IPhone” or “Blackberry”, the user would be prompted to enter the generic

smartphone attribute information before continuing. Although not displayed in Figure

4-5 information regarding the location within the tree can be maintained in the query by

adding a level value along with the parent node information. A visual representation of

the tree can be seen in Figure 4-6 from which the previous tables were designed.

Figure 4-5 – This result displays the input requirements based on which node is selected; for

example the color attribute must be specified if the smartphone node is traversed.

Figure 4-6 – The domain knowledge tree that was used for Figure 4-5.

67

4.3 Searching process & Indexing Classified Posts

Although we’ve previously classified the online posts, they are not classified or

indexed to the same degree as reported cases; as such we cannot directly compare the

reported case to online posts. This is due to the more broad classification patterns that are

derived from the online post clusters being used to classify online posts while more

precise domain knowledge is used to classify reported cases. We must either attempt to

leverage the domain knowledge to further organize the online posts into the same

structure of our reporting systems domain information or we must match the data in the

reporting system to the online posts. As discussed in the previous chapter, the

classification system cannot directly use the domain knowledge tree due to its lack of

flexibility when dealing with online text; as such the extracted classification tree must be

stored separately to maintain the integrity of the imported domain knowledge.

This can be achieved by having all posts that have been classified indexed to the

respective node within the reporting system tree, or if the tree structure is significantly

different between the generated classification tree and the imported domain

information’s tree, then have the posts indexed to the classification tree and have a

mapping table between the classification tree and the imported domain information’s

tree. Perfectly classifying online posts into the domain information tree structure may not

be possible or may result in most posts only being indexed to major nodes such as

“IPhone 5”, which is no better than their previous classification. This is largely due to the

lack of information present in online posts; many distinguishing features are most often

not advertised; however other attributes can be indexed based on report metadata, since

that will be the largest distinguishing factor matching the reported case with posts. We

will check items within the closest proximity and time to the theft, later checking other

related nodes such as other versions of IPhones in the event that the item was

misclassified.

This allows for all sub-classification posts to also be compared, while optionally

moving upstream a limited number of nodes to compensate for user misclassification.

While completeness would dictate we should traverse up to the root node of the domain,

68

this is both impractical and infeasible to ignore both the users submitted data and

classification process to compare the users report to every post within that domain. That

is to say that we must bound the trustworthiness of the users input to an extent; we

should be able to assume they got the manufacturer correct and possibly the model as

well. This can simply be done by lowering the system confidence in potential matches

for each traversal upwards or by explicitly restricting upward traversals with NT

unidirectionality for child nodes beneath key root nodes. An example of a formula that

could be used for system confidence relative to traversals can be seen below, while a

visual representation of this can be seen in ;

Confidence = 100 – 2𝑛 + 3 Where n is the number of transitions currently

done and the maximum number of transitions is 3.

(7)

0

20

40

60

80

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Co

nfi

de

nce

(%

)

Number of Transitions (n)

System Confidence Based on Number of Transitions

Figure 4-7 - A graph of Equation (7) displaying the confidence as the number of transitions

increases.

69

Following the example above, let’s assume that a user reports a 16GB silver

IPhone 5s was stolen in Toronto around 8pm on May 5th. Although the user would also

be required to give additional specifications, for the purpose of this example we will

ignore those inputs assuming that they are not utilized in the search process. The system

would attempt to match this report against posts classified into the same node in the tree

(IPhone 5s Silver 16GB), checking that the date and time of the post are after the time of

the theft (8pm May 5th) and that the post is selling the item in relatively close proximity

to where the theft occurred (in the Greater Toronto Area). As previously discussed the

posts date and time should always be after the time of the theft unless it was

premeditated, however for a majority of thefts this will not be the case. The difference in

the time of theft and the time of posting should also be minimal, as this time increase the

probability of this being the correct match decreases. As for the selling location, it would

not be impossible for the item to be sold in further locations such as Kitchener or

Hamilton but as the distance increases the probability of this decreases exponentially. If

no matches are found within the surrounding area or the confidence of these matches is

very low (ie the distance or time variance is very large) the searching process will repeat

from the parent node, in this case it would repeat the searching process from the “IPhone

5s Silver” node. This process can search all other child nodes or can exclusively search

the parent node based on system settings; as we can choose to inclusively search these

nodes or assume the information that resulted in a different classification, such as color,

is correct and sufficiently excludes them from being candidate matches. This process

would repeat until it reached the “IPhone” node and would terminate after searching that

node because any further traversal upwards would no longer reference any user input.

Upon the searching process ending the results are returned to the authorities, a detailed

discussion of what information can be returned to the user can be found at the end of this

chapter. It should be noted that although the system settings can determine the inclusivity

of other child nodes after an upwards traversal in the domain knowledge tree, the

searching process should be inclusive for all nodes below the node the users reported

cases was indexed.

70

4.4 Updating Domain Information

As new models are released for a given domain, the information in the reporting

system’s domain information tree will become outdated. There will naturally be a delay

between the time the item is released to the public and the time the classification

information is added to the reporting system.

First let’s consider a scenario where domain information is not updated as new

models are released. If we look at an existing manufacturer releasing a new model, such

as Apple IPhone 6, when posts containing this classification are being mapped to the

domain information tree there is no matching child node beyond IPhone and thus those

posts would be indexed to the IPhone node. This causes a lot of structural information to

be lost during the mapping process and results in a large number of posts being mapped

to the manufacturer node or nodes directly below the manufacturer node. This issue

becomes more clear if we look at manufacturer that don’t have a logical naming

convention to their models, which would result in new models being mapped directly to

the manufacturer node. If we consider new manufacturers entering the market, such as

HTC, when posts containing this classification are being mapped to the domain

information tree there is no matching at all within the “cellphone” sub-tree. This would

result in posts being mapped to the “cellphone” node inappropriately and due to the

searching limitation previously introduced; transitions up to the “cellphone” node are

restricted thus making these posts inaccessible to searches.

Looking at these scenarios from the user’s perspective, it will result in the user

looking to select a manufacturer or model that doesn’t exist within the list. Even if we

allow the user to manually define the manufacturer or model they wish to report, which

loses control over the user input, there still exists the problem that the system lacks

domain information for the users desired input. If we simply stop the user input at the

point where the domain information ends, we will have very little information about the

stolen item; however we lack the specific domain information to continue accepting user

input.

71

From this scenario we can see that lacking domain information causes issues but

also introduces the following problems; how can we handle user input for items that

don’t exist within the domain information tree and how can the system determine when

domain knowledge update should be requested. These issues are related but they are

associated with different viewpoints; the first being the user-centric problem while the

second is system-centric problem. To deal with the first issue, the system should be able

to accept reports for items it doesn’t have domain information about, as otherwise the

usability of the system is drastically lowered due to being unable to handle new items.

The system must allow the user to submit their own information about an item that it has

little to no knowledge about. As mentioned previously allowing free-form input from

users is not desired however the system does have metadata about where the item should

be located.

For example a user is attempting to report a new model of IPhone, they will

naturally traverse down the smartphone and Apple branches. Where the user stops is

important as it gives us an upper bound for the scope of parameters we want to look at.

From this point we would want to collect the minimum amount of information from the

user and request more information once the domain knowledge is updated, in a way

handling the users report as best we can with no actual domain information about the

item. A minimalistic approach is more beneficial since it reduces the chance for user

error while also reducing the amount of user input that must later be verified.

An example of how this would be triggered can be seen in Figure 4-8. The user

would be presented with input fields similar to how they were handled previously, but

these fields would be generated using the highest probability of subcategories in the

parent node where they stopped. In this example we would compare all the sub-nodes

and attributes of all the other models of IPhones in attempts to find potentially related

information to their item. This would not work if the user stops very high up in the tree,

say “Apple” node because that would be a new product line with likely unique

characteristics and not a small variation such as model. An example of how the attributes

would be generated can be seen in Figure 4-9. However attribute prediction can be

72

wrong, unless older models attributes are weighted less than newer ones, as with equal

weighting or based on the scale of the weighting the color attribute was incorrectly

predicted in the example above.

To address the issue from a system perspective, knowing when to request new

domain information can be done a few ways; if the system receives a large number of

reported cases for items that it doesn’t have domain knowledge for, it may be an

indication to initiate a domain knowledge update. A slightly more pre-emptive approach

would be as new classification trends are noticed under nodes within the domain

knowledge tree (such as IPhone) an update can be initiated. Both triggers may be desired

due to delay in emerging classification patterns. Domain information can be updated

using a similar approach to how the domain information was input into the system.

While new branches of domain information can be handled exactly as they were during

the initial domain information import, which` would be the case for new product lines;

related sub-branches of domain information can be handled in a slightly more intuitive

manner. Similar to how the user was prompted for derived characteristics of other sub-

branches, this information can be used to filter manufacturer specifications to a

reasonable level such that they are in line with the amount of detail requested in other

nodes.

Figure 4-8 – An example of how users’ would select the “Other” option should their desired category

not exist within the domain knowledge tree.

Figure 4-9 – An example of how attribute prediction would function; predicting the model sizes

correctly but predicting the model colors incorrectly.

73

4.5 Periodic re-checking user reported cases

The final aspect of this system that must be discussed is how often the users

reported cases should be re-evaluated against the classified posts. This can either be done

with an active approach or a passive approach, by either comparing newly classified

posts as they enter the system or by waiting an interval of time before generating a

comparison to the current classified posts.

The largest benefit with an active approach is that it gives real-time results while

also only comparing the reported cases with new posts. The disadvantage of this is that

unfortunately newly classified posts entering the system may be compared with

incomplete domain knowledge at the time. This would result in inaccurate comparison

between posts and reported cases, also introducing inconsistencies in the relative

confidences between posts.

This issue is solved by a passive approach, since it there is a delay between newly

classified posts and when the reporting system runs the comparison process. This

however results in a delay in the comparison, which is undesirable for time sensitive

applications such as selling stolen goods. When considering a passive approach

determining an appropriate amount of time that newly classified posts should be kept

before running a comparison is both difficult and domain specific. Given that an active

approach allows for real-time analysis, it makes more sense to use this approach and

attempt to address the issue of inconsistent domain information during the comparison

process. Although this approach provides real-time analysis it is at the experience of a

slightly increase in the processing overhead and much larger storage overhead; each

reported case must maintain a list of posts that have already been matched along with its

match confidence and domain knowledge version or date of comparison. Storing the

domain knowledge version or date of comparison addresses the issue of using different

domain information to compare different posts; allowing for older posts to be re-

compared as the domain information changes, to maintain consistent comparisons to

reported cases.

74

Although the storage overhead is rather large, reducing it would increase the

processing overhead which is more valuable. Due to this storage overhead it does bring

into question when reported case should be abandoned. Ideally reported cases would find

appropriate matches and exit the system but for reported cases where no acceptable

match can be found in a reasonable time there must be a way for these cases to exit the

system so that they don’t cause unnecessary overhead. As the confidence of the match is

dependent on the difference between the time of the theft and the time of the post being

made, after a long enough period of time even if a perfect match is detected the

confidence would be very low; potentially so low that its undistinguishable from noise or

extremely poor comparisons. Although this timeframe may be domain dependent, it

would be reasonable to assume that after 2-3 months if the case has not yielded a

reasonable match then it is unlikely it will. Additionally any inferred information about

the relationship between the post and the theft degrades over time, such as the proximity

can no longer be guaranteed. At this point it can be concluded that; the item was stolen

for personal reasons, the item was sold in a different medium, or the item was sold online

and the system was unable to identify it. Potential reasons the system may have been

unable to identify the matching post could have been due the item being sold on a site

not referenced by the framework, the lack of domain information at the time of

comparison, the lack of parameters that would have identified the match, or after

correctly identifying the matching post authorities were unable to pursue the lead.

75

4.6 Experiments

4.6.1 Experiment – Active Search using the Reporting Database

In this experiment we wanted to determine how much of a match could be made

with reported cases of stolen property. As such we must construct reported data, and

either explicit or derived domain knowledge about the topic; for this experiment we will

use explicit domain knowledge of various phone brands, models, and specifications. This

information is used to compare the percentile match between the reported data we have

against the set of posts; lowering the systems confidence should the post contain multiple

classification patterns. This comparison will be done by a simple keyword matching to

reduce the complexity however further NLP techniques could be applied to improve the

accuracy of the comparison; additionally a 100% match enforcement policy is used to

reduce the returned results. For this experiment we simulate the reported data by using

the following stolen property:

Type Manufacturer Model Family Model Carrier Color

Cell Phone Research in Motion Blackberry Bold 9900 Bell Black

The results below in Figure 4-10 are prototype output from matching posts from

the criteria above, referencing the local copies of the posts and not ranking them by

percentage match. On the left hand side we can see the extracted price from the post,

followed by the location of the stored post.

Figure 4-10 - Results of the search process; displaying a rudimentary match with the posts price on

the left followed by the local location of the matching post on the right.

76

Based on these returned results a human would need to analyze these posts to

determine if there was enough of a match, however this list could further be reduced by

incorporating other primary and secondary attributes. At this point the system would

need to be refined like any AI system, by recording which results the human determined

were correct matches and which were incorrect matches. It should be noted that although

all the posts may match there will still be some decision criteria that the system is

unaware of, and thus requires a feedback loop in order to help identify these.

4.7 Discussion

Another issue that we discovered is to what extent information should be

presented to the user submitting the reported case. Although law enforcement agencies

strongly advise people not to attempt to recover their stolen property without the

assistance of the police, occasionally people still do attempt to forcefully recover their

property. As such what information the reporting system displays is important such that

it does not act as a conduit for people to track down their stolen items when we cannot

guarantee these are the matching stolen items. Thus there must exist the balance between

presenting the user with enough information such that they can confirm a match for the

police while not allowing them to contact the seller or retrieve the original post.

Obfuscating the posts content before displaying any results to the user would be difficult

but required. Displaying any unique images to the user would be ideal since it can easily

be used to offload the refinement process from the authorities to the users; however

searchable unique characteristics such as name, contact number, identifiable spelling

mistakes or string must be either obfuscated or redacted prior to being displayed. This

topic would require additional research beyond the scope of this thesis; as such it will be

left for future work.

77

Chapter 5

Framework Portability and Applications in Other Domains

In this chapter we will discuss how this framework can be applied to other

domains and what rules must be altered to achieve similar results in these domains. The

two domains we will look at are rental property scams and metro pass scams.

In order to understand how these domains relate to the previous work and why

extending the framework to these domains is important we will look at a couple of

examples. First if we consider a person looking for a vacation rental property, they would

begin by looking for a rental properties in their vacation destination. They may use a

reputable site and look for reputable areas within the city but ultimately they must contact

the person posting the advertisement either online or by telephone. They may ask for

additional information regarding the property and should both parties be satisfied with the

agreed rates, the owner would most often ask for a deposit for reserving the property for

the given time. Up to this point the prospective renter has assumed the person they

contacted is the legitimate owner, they may have even asked specific questions in

attempts to confirm this, however they will not be able to confirm this until visiting the

property which would only occur at a later date. This scenario is the essence of the scam,

the prospective renter must give the contact a deposit before being able to authenticate

them as the legitimate owner.

The second example we will look at is a potential metropass scam where a user is

looking for a metropass at a reduced rate. First we must consider how a legitimate seller

would attempt to advertise their metropass; they would likely have a reduced price in

comparison to the retail market price in attempts to sell the item in order to recover some

of their initial investment. They may be very vocal of their eagerness to sell the item due

to its time sensitive nature, since its value will decreases over time and has a fixed

lifespan. Advertisements for such items would also be very non-descriptive due to simple

and implied information regarding the item. From the buyers perspective they cannot

authenticate the item until they purchase it and attempt to use it. It is reported that fake

78

metropasses cost the Toronto Transit Commission (TTC) close to 2 million dollars in

2012 [20], this shows the increasing problem of scams in our society.

Both of these examples share the fact that an initial investment must be made

before the item or property can be authenticated. These two domains also have the

interesting characteristic that no one is actively searching to report these posts and

reporting these posts is much more difficulty and at the discretion of the site. While

people may attempt to report rental property scams for their properties it’s rather

cumbersome and most often involves simply using Google to search for their address;

this must be done periodically and doesn’t guarantee that it will always catch every post.

Similarly metropass scams are often not reported due to transit provider not having the

man power to actively search for these scams and fraudulent costs are most often offset in

the base ticket price or the consumer buying the scam metropass should it not work. This

makes these domains and domains that are easy to scam very lucrative targets with no

fixed end point; that is to say we must constantly be searching for newly matching posts

in order to report them and no cases can ever be closed.

Because scams are such high reward for very little risk it is not surprising that

there are a large number of scams for any given domain, as such fitting this framework to

detect scams is valuable both from a social standpoint as well as criminology standpoint.

Monitoring suspicious post activity, especially if the advertising website are involved,

can result in criminal profiling based on linguistic analysis, post modification, and

movement within their site.

One of the main objects of this research was to develop a domain independent

framework that could easily be translated into any domain with very little effort. This is

one of the main reasons we’ve attempted to avoid having a heavy dependence on domain

knowledge. This allows us to apply the approach to many other domains, notably scams

in which the buyer or the manufacturer are the victims. Some of the assumptions used to

detect potentially stolen items are no longer valid such as the eagerness to sell the item

may no longer be present due to lower risk since the item is not stolen. Two domains that

79

we want to test are metropass scams and rental property scams. While the metropass

scams follow very closely with the previous framework with respect to fixed description

of the item, the rental property scams are far more categorized then the previous

framework. We will not be discussing the complete framework in this section but instead

how the previously described framework would be modified to tackle the target domains.

5.1 Metropass Scams

We will first discuss detecting metropass scams since it more closely follows the

structure of stolen property detection. Although this focuses primarily on Toronto

metropass scams it is comparable to any public transit system or fixed service oriented

system.

5.1.1 Ad retrieval

This subsection identifies which sites are good candidates for harvesting posts to

analyze. Similar to the previous framework sites such as Craigslist, Kijiji, etc. are all

good sources for metropass scams since its intended audience is other customers where

the market demand for the given item is not so large that it merits a dedicated site. For

Craigslist the subcategory of tickets and general were the most common classification for

metropasses.

5.1.2 Primary and Secondary Attributes

This subsection identifies the attributes that can be used during the speciousness

classification process. Unfortunately due to the nature of the problem many primary

attributes are no longer valid due to it being implicit to the event of the theft, for example

date/time and proximity are no longer relevant, leaving only the price attribute.

Additionally many if not all of the secondary attributes also no longer apply. Since

condition of the item is irrelevant, photos are unlikely to be included or if they are we can

expect they will be stock images for a majority of cases. The sellers eagerness to sell will

likely also be very unreliable due to the nature the domain, since the potential risk is

80

drastically reduced for the scammer we would expect their eagerness to be on par or

below legitimate posts, due to legitimate posts actually having a potential for financial

loss if there is not a sale within a period of time. Finally duplicate contact information on

multiple posts would be an indication of a potential scam due to the item existing on the

market for a prolonged period of time; however it’s unlikely the scammer will make

duplicate posts since they can simply “sell” the item to every person that contact them

and the item will never change.

5.1.3 Experiment

In our experiment we looked at 40 posts for Toronto metropasses on Craigslist

and Kijiji, comparing the asking price to the controlled market price after adjusting based

on the amount of time left on the pass. A price histogram of this ratio can be seen below

in Figure 5-1 and Figure 5-2.

0

5

10

15

70% 75% 80% 85% 90% 95% 100% 105% More

Nu

mb

er

of

Occ

urr

en

ces

Ratio of Asking Price to Market Price

Metropass Price Histogram

Figure 5-1- Metro pass histogram using 5% intervals.

0

2

4

6

8

10

12

55% 60% 65% 70% 75% 80% 85% 90% 95% 100% 105%Nu

mb

er

of

Occ

urr

en

ces

Ratio of Asking Price to Market Price

Metropass Price Histogram

Figure 5-2- Metropass histogram using 2.5% intervals and extended lower bound.

81

These results were unexpected and differ from other domains due to their pricing

being very close to the market price; due to the lack of data we did not differentiate

regular metropasses from student metropasses since their target market would be the

same regardless. However had this differentiation been done the prices would have been

even more heavily skew towards the fixed market price. We assume that the selling

prices were very high either due to anticipated haggling or simply that potential buyers

may have no alternative. Since this is a transportation service it could be assumed that

potential buyers must use the service regardless and must either use weekly passes or buy

a new monthly pass in the middle of the month, as such more aggressive pricing models

can be used. Due to these results it is clear that price detection cannot be reliably used in

all domains, and in this case derived suspicious keyword strings would likely be required.

Upon reviewing the outlying posts at the lower end, they both appear within the expected

range of a traditional domains market price distribution and displayed no characteristics

of suspicion.

5.2 Rental Property Scams

We will now discuss detecting rental property scams, this domain differs more

from the previous framework and requires more initial domain knowledge.

5.2.1 Ad Retrieval

This subsection identifies which sites are good candidates for harvesting posts to

analyze, however unlike the previously discussed domains that used generic classified ad

sites such as Craigslist or Kijiji, property rental or vacation property rental domains are a

large enough industry that there are dedicated sites for these domains. Dedicated sites

such as homeaway.com and vacationrentals.com can be used in addition to the previously

mentioned vacation rental subsections on classified ad sites.

82

5.2.2 Classification

This subsection discusses how classifications can be derived. For the

classification process to function properly initial domain knowledge is needed and needs

to manipulate the posts so that proper classification can be derived. One of the most

important pieces of information needed is geographical context such as cities in

provinces/states and provinces/states in countries. This information is required due to

contextual implications inherent to the posts, for example users may simply advertise

their address with the implication that it’s in the GTA when posting on the Toronto

subdomain on Craigslist. Although this doesn’t seem that important it is required for

building a hierarchical tree within the classification process.

The second most important piece of information needed is the format of how

these locations are identified, while many other domains can get away with simple

keyword extraction, this domain requires interaction with domain information to know

how to select these keywords. This can be done either with an n-gram selection model

after finding the largest classification of location or by more complex rule sets that must

select an item from each location category.

5.2.3 Clustered Data

Once location knowledge is part of the classification process we can expect

classifications based on country, province/state, and city. However this classification isn’t

precise enough to accurately compare prices due to the large variance is property value in

different areas of a city. Thus we would have very large clusters in which prices would

not be directly comparable due to a skewed market average and high standard deviation.

To address this much more granular clustering must be done before comparing

prices, however this clustering does not need to be done in the classification process.

Additional domain knowledge can be imported to indicate which areas, streets, etc. are

adjacent to one another, providing a graph for average prices and predictive price

gradients to be overlaid.

83

5.2.4 Primary and Secondary Attributes

This subsection identifies the attributes that can be used during the speciousness

classification process, and similar to the previous domain date/time and location

attributes are no longer applicable. However many secondary attribute are applicable,

such stock photos, duplicated descriptions, and contact information.

Since photos give potential renters a substantial amount of information about a

property it is quite uncommon for there not to be a photo of the property in the

advertisement. Due to this fact scam posts must take the images they provide from

somewhere and if a duplicate image is found then we know there is something wrong

with one of the two posts containing this image. The similar problem exists for the

property description, since it’s much easier for the scammer to simply copy a user’s

description then to create a unique one, large duplications in descriptions would likely

indicate that the description is copied. Similar to the stolen property domain, if a user’s

contact information is on a large number of posts then it is an indication that there may be

an issue. This also allows for transitive suspiciousness between posts should the other

posts by the user contain duplicated description or stock images.

5.2.5 Experiment

In our experiment we looked at posts for vacation rentals that had pictures from

Craigslist site for most major cities in Canada. This involved analyzing roughly 6000

images from 800 posts, hashing the images and mapping them to their respective posts;

1130 images were found to be included in multiple posts. Given these numbers we can

tell that most posts do not contain duplicate images however based on the Figure 5-3 we

can see that a majority of image duplication exists between two posts, while higher

degrees of duplication are also present.

84

Many of the examined cases of 2ed degree duplication were simply from users

recreating their posts after a few weeks; this subset is where we would expect to find

cases of vacation rental scams however due to the volume of cases we were not able to

examine all of them. Larger degrees of duplication were most often from companies

advertising their rental property across multiple Craigslist regions, most often for remote

or external locations relative to the advertising location.

Although this experiment doesn’t directly show conflicting listings using the same

images, it does show the vast difference between how users behave in various domains.

Users in this domain frequently recreate their posts, many times on multiple sites and

adjacent locations; as such to identify cases of rental property scams would require

unique image association to specific user which was not done in this experiment.

58.85%20.09%

9.82%

4.69%

3.63%

0.44%

1.33%0.27%

0.18% 0.09%

0.09%

0.09%

0.09%0.35%

Degrees Image Duplication in Posts

2

3

4

5

6

7

8

9

10

11

Figure 5-3 - Percentage of image duplications on multiple posts; the legend indicating the degree of

duplication.

85

5.3 Discussion

Theses domain extensions can be applied in a slightly different manner to the

previous covered topic of stolen property detection and recovery, largely due to their

more abstract nature. In the domain of stolen property detection and recovery the

emphasis is on reactive recovery of the item with limited preventative measures also in

place; where as in the scam domains there is much more value in proactive measures to

prevent the sale. This makes the target audience for the two domains different due to

their different goals, scam prevention mechanisms are much more beneficial for websites

while item recovery mechanisms are much more beneficial to the police. While these are

not mutually exclusive, websites will benefit from the prevention and recovery of stolen

items especially from a public relation standpoint as well as reducing the amount of

customer complains when stolen items or scams are discovered, and law enforcement

agencies will have less reported cases of scams with scam prevention mechanisms in

place. However this approach can be applied in both cases to result in different goals, this

can be applied by websites to warn its users of potential scams, while it can also be

applied by the police to track scammers across domains.

Unfortunately all domains are subject to the arms race between law enforcement

agencies and criminals. This is especially true in domains such as scams, where there is a

very low initial risk and moderate to high payoff, which makes defeating the

countermeasures in place a high priority. From a price standpoint, which is currently a

major attribute in the detection process, there does exist end points in which prices below

or above these points are not detect because they exist within the normal range. If the

seller increases the price to avoid detection then the items price is too close to that of a

reputable provider which would result in the potential buyers choosing the reputable

provider. If the seller decreases the price to avoid detection then the items price is too low

such that people know it’s a potential scam or the scam is no longer profitable in the case

of counterfeit goods. Because of this limitation it is much more likely that technical

components will be targeted for exploitation so that the scams are not classified as

suspicious. Examples of this would be attempting to confuse classification system into

misclassifying the item, obfuscating the decision parameters, etc.

86

Chapter 6

Conclusion and Future Work

In this chapter, we give a brief summary of the work done and future research

directions for this topic.

6.1 Conclusion

This thesis presented a complete stolen property detection and recovery

framework. First we proposed a method of automating the process of collecting classified

ads from popular 3rd party sites such as Craigslist, Kijiji, etc. and process them into

dynamic classifications based on the trends that exist in the data. This process addressed

some of the issues that can occur when attempting to crawl the target sites, introducing

caching and indexing techniques to reduce the overhead of this process. Second, we

designed a clustering scheme that would analyze the posts to see if they should be

flagged as suspicious as well as analyzing clusters of suspicious posts to extract new

trends that could identify suspicious posts with more confidence. This portion of the

proposed framework takes advantage of the amount and nature of text in an online

medium, accounting for spelling and grammatical errors in the text, while still being able

to effectively extract classification trends; while also identifying and addressing the fact

that we know very little about quantifiable characteristics that can be used to identify

stolen items online. Finally we proposed a reporting framework where allows users to

submit reports of stolen items, which would be searched against suspicious posts in

attempts to find their stolen property; results with sufficient confidence would be

forwarded to the police. This approach takes advantage of the structure and information

in the domain knowledge, allowing for user input to be stepwise; while also addressing

scenarios where domain knowledge is missing. Compared to other attempts to solve this

issue, the proposed method is comprehensive, requires less human intervention, and is

modular to the target domain. Initial experimental results show promising results for both

single and multiple classification pattern accuracy, successful identification of suspicious

price clusters, as well as rule extraction of suspicious clusters. We also test the

frameworks portability to the domain of scams, specifically metropass scams and

87

vacation rental scams. The frameworks application in other domains identifies how the

frameworks must be altered to fit to the new domain, as well as the results identifying

fundamental assumptions that no longer hold true of the new domain compared to that of

the traditional selling of stolen good. We display the benefits to both websites and law

enforcement agencies should they implement this framework; increasing the websites

reputability while reducing the amount of customer complains when stolen items or

scams are discovered, while also reducing the load on law enforcement agencies by

having an automated system in place to find the most likely lead.

6.2 Future Work

One of the more predominant directions this research has revealed is the ability to

use an AI model to extract trends from suspicious posts. Similar to how expert systems

are used to detect fraud or other crimes, these systems are often very domain specific and

require extensive training data; however with the current system although it requires a lot

of training data, this can be acquired automatically from online sources. This allows the

system to be more flexible to its target domain while still extracting general theft trends

and domain specific theft trends. This flexibility also allows for trends to be specific to

certain levels of the criminal hierarchy since the identifiable characteristics change from

low-level thieves to professional fences.

Another area of interest is the idea of using reactive market trend analysis to

request or retrieve updates to domain expertise. Considering the issues discussed

previously, market prices can often have very rapid decreases based around new models

being announced or released. This serves as an indication to request or attempt to retrieve

information about new models at this time, thus speeding up the classification process

due to resolving the lacking domain knowledge about new trends.

Additionally the concept of secondary attributes was mentioned but not fully

explored in this thesis due to their difficulty to implement; they tie heavily into the rule

extraction and accuracy of the suspiciousness. Many of these approaches do not require

88

the full application of NLP and constitute some of the human logics that are

subconsciously used in detecting if the item is suspicious. Detecting the seller’s eagerness

to sell was one of the more complex attributes but also one of the more promising and

universal attributes for all domains where stolen items are being sold.

Finally one of the goals of this work was to extend the classification of suspicious

and non-suspicious data set in order to train other classification and detection

frameworks. This can be used to bootstrap expert systems for emerging domains due to

its domain portability.

89

References

[1] Wikipedia contributors, “Fence (criminal),” Wikipedia, May 9 2014. [Online]. Available: http://en.wikipedia.org/wiki/Fence_%28criminal%29 [Aug 15, 2014]

[2] J. Fuller, “How eFencing Works,” howstuffworks, Aug 2014. [Online]. Available: http://computer.howstuffworks.com/efencing.htm [Aug 15, 2014]

[3] “Tor,” Torproject, Aug. 2014. [Online]. Available: https://www.torproject.org/ [Aug 15, 2014]

[4] “Oregon man recovers stolen bike after sting operation,” Fox News, Aug. 2012. [Online]. Available: http://www.foxnews.com/us/2012/08/16/oregon-man-recovers-stolen-bike-after-sting-operation/ [Oct. 1, 2012]

[5] D. Moye. “Kenneth Schmidgall Tracks Down Stolen IPhone, Fights The Guy Who Has It (VIDEO),” Huffington Post, Aug. 2013. [Online]. Available: http://www.huffingtonpost.com/2013/01/08/kenneth-schmidgall-stolen-iphone_n_2433101.html [Feb. 1, 2014]

[6] Stolen911.com, Aug. 2014. [Online]. Available: http://stolen911.com/ [Oct. 1, 2014]

[7] J. Treadwell, “From the car boot to booting it up? eBay, online counterfeit crime and the transformation of the criminal marketplace,” in Criminology and Criminal Justice , Vol. 12(2), April 2012, pp. 175-191

[8] D. Lieberman and L. Effron, “Is You Stolen Stuff on Craigslist? Here’s What to Do”, ABC News, Oct. 2011. [Online]. Available: http://abcnews.go.com/blogs/technology/2011/10/found-your-stolen-stuff-on-craigslist-tips-on-what-to-do-2/ [Oct. 1, 2012]

[9] “Stolen Blackberry Q10”, Stolen911.com, July. 2014. [Online]. Available: http://stolen911.com/category/1063/Stolen-Blackberry-Smartphones/listings/18658/Stolen-Blackberry-Q10.html [July 15, 2014]

[10] “LOST BLACKBERRY SMARTPHONE”, Stolen911.com, July. 2014. [Online]. Available: http://stolen911.com/category/1063/Stolen-Blackberry-Smartphones/listings/19251/LOST-BLACKBERRY-SMARTPHONE.html [July 15, 2014]

[11] E. Brill, “A simple rule-based part of speech tagger,” in Proceedings of the third conference on Applied natural language processing, Trento, Italy, 1992, pp. 152-155

[12] M. Marcus et al., “The Penn Treebank: Annotating Predicate Argument Structure,” In Proceedings of the workshop on Human Language Technology , Plainsboro, NJ, 1994, pp. 114-119

[13] J. Yi et al., “Sentiment analyzer: Extracting sentiment about a given topic using natural language processing techniques,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 427-434

[14] D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” in Lingvisticae Investigationes, Vol. 30(1), 2007, pp. 3-26

90

[15] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” in Information Processing and Management, Vol. 24(5), 1988, pp. 513-523

[16] A. Shiri et al., “Thesaurus-enhanced search interfaces,” Journal of Information Science, Vol. 28(2), 2002, pp. 111-112

[17] “Boost C++ Libraries,” Boost, Aug. 2014. [Online]. Available: http://www.boost.org/ [Feb. 2013]

[18] “GSL – GNU Scientific Library,” GNU, Aug. 2014. [Online]. Available: http://www.gnu.org/software/gsl [March 2013]

[19] Craigslist, Aug. 2014. [Online]. Available: http://toronto.en.craigslist.ca/moa/ [Feb. 1, 2013]

[20] C. Mills, “Fake Metropasses and tokens cost TTC close to $2M last year,” Toronto Star, Feb. 17, 2013. [Online]. Available: http://www.thestar.com/news/gta/2013/02/17/fake_metropasses_and_tokens_cost_ttc_close_to_2m_last_year.html [Aug. 15, 2014]

Documents

E-Fencing Detection: Mining Online Classified Ad Websites for … · 2014-10-31 · E-Fencing Detection: Mining Online Classified Ad Websites for Stolen Property by Christopher Carver