Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
E-Fencing Detection: Mining Online Classified Ad Websites for Stolen Property
by
Christopher Carver
A Thesis Submitted in Partial Fulfillment
of the Requirements for the Degree of
Master of Science
in
Computer Science (MSc)
The Faculty of Business and Information Technology
University of Ontario Institute of Technology
September, 2014
©Christopher Carver, 2014
ii
Abstract
With the emergence of e-fencing, there presents a need to automate both the
detection of ads selling stolen property and the reporting process for victims. This thesis
presents a framework that dynamically retrieves and classifies online ads utilizing
artificial intelligence (AI) to minimize amount of domain knowledge required. Evaluating
these ads against existing known characteristics of theft as well as extracting new
characteristics from suspicious ads. This in conjunction with a reporting system allows
users to report events of theft and matches them to previously classified ads. The
framework was designed such that it would be domain portable and allow for rapid
adaptation to other domains. Experiments showed promising results, correctly classifying
single and multiple trend datasets, displaying anomalies in price histograms, and
extracting potential patterns that explain price variance. Experiments on other domains
highly susceptible to scams displayed unique results contradicting some fundamental
assumptions of the behavior of thieves.
Keywords: e-fencing; stolen property; classification problem; genetic algorithms; rule
extraction
iii
Acknowledgments
I would like to express my deepest gratitude to my supervisor Dr. Xiaodong Lin
for his useful comments, patience, guidance, and engagement throughout the learning
process of this master thesis. I learned many valuable lessons from him; it was an honor
for me to work with him.
iv
Table of Contents Abstract ................................................................................................................................ ii
Acknowledgments............................................................................................................... iii
List of Figures ..................................................................................................................... vi
List of Abbreviations ........................................................................................................ viii
Chapter 1 Introduction ................................................................................................... 1
1.1 Background and motivation ................................................................................. 1
1.2 Objectives and methodologies ............................................................................. 4
1.3 Contributions ........................................................................................................ 6
1.4 Thesis organization .............................................................................................. 8
Chapter 2 Related Work ................................................................................................ 9
Chapter 3 E-Fencing Detection Framework ................................................................ 14
3.1 Ad Retrieval ....................................................................................................... 18
3.2 Ad Classification ................................................................................................ 19
3.2.1 AI Model ..................................................................................................... 19
3.2.2 Chromosome Representation ...................................................................... 21
3.2.3 Fitness Overview......................................................................................... 22
3.2.4 Fitness Function .......................................................................................... 26
3.2.5 Population Transition .................................................................................. 29
3.2.6 Termination Condition ................................................................................ 30
3.2.7 Data Clustering ........................................................................................... 31
3.3 Rules and Rule Database .................................................................................... 32
3.3.1 Application of Primary Attributes .............................................................. 34
3.3.2 Application of Secondary Attributes .......................................................... 37
3.3.3 Rule Extraction ........................................................................................... 40
3.4 Reporting Database ............................................................................................ 41
3.5 Experiments........................................................................................................ 44
3.5.1 Experimental Environment ......................................................................... 44
3.5.2 Experiment – Online Post Classification Accuracy .................................... 45
3.5.3 Experiment – Price Extraction and Averaging ........................................... 49
3.5.4 Experiment – Rule Extraction on Suspicious Clusters ............................... 52
v
3.6 Discussion .......................................................................................................... 57
3.6.1 Ad retrieval ................................................................................................. 57
3.6.2 Price Extraction........................................................................................... 57
3.6.3 Rule Extraction ........................................................................................... 59
Chapter 4 Reporting System ........................................................................................ 60
4.1 Design................................................................................................................. 60
4.2 Domain Knowledge Structure ............................................................................ 65
4.3 Searching process & Indexing Classified Posts ................................................. 67
4.4 Updating Domain Information ........................................................................... 70
4.5 Periodic re-checking user reported cases ........................................................... 73
4.6 Experiments........................................................................................................ 75
4.6.1 Experiment – Active Search using the Reporting Database ....................... 75
4.7 Discussion .......................................................................................................... 76
Chapter 5 Framework Portability and Applications in Other Domains ...................... 77
5.1 Metropass Scams ................................................................................................ 79
5.1.1 Ad retrieval ................................................................................................. 79
5.1.2 Primary and Secondary Attributes .............................................................. 79
5.1.3 Experiment .................................................................................................. 80
5.2 Rental Property Scams ....................................................................................... 81
5.2.1 Ad Retrieval ................................................................................................ 81
5.2.2 Classification............................................................................................... 82
5.2.3 Clustered Data............................................................................................. 82
5.2.4 Primary and Secondary Attributes .............................................................. 83
5.2.5 Experiment .................................................................................................. 83
5.3 Discussion .......................................................................................................... 85
Chapter 6 Conclusion and Future Work ...................................................................... 86
6.1 Conclusion.......................................................................................................... 86
6.2 Future Work ....................................................................................................... 87
References ......................................................................................................................... 89
vi
List of Figures
Figure 2-1 - An example of a reported case of theft from [9];.......................................... 11
Figure 2-2 - An example of a reported case of theft from [10];........................................ 11
Figure 3-1 – Proposed framework overview; ................................................................... 16
Figure 3-2 – Visual representation of classification pattern sizes and their intersections. 24
Figure 3-3 - Frequency analysis of words in training post set; ......................................... 25
Figure 3-4 – Harmonic mean of F2 and F3. ....................................................................... 28
Figure 3-5 – Harmonic mean of F1 and F2. ....................................................................... 28
Figure 3-6 - Graph of passed vs. failed chromosome translations based on Equation (6)............................................................................................................................................ 29
Figure 3-7 – Displaying the average population fitness compared the best members fitness over 5 iterations. .................................................................................................... 30
Figure 3-8 – An example of how data clustering creates a tree structure. ........................ 31
Figure 3-9 – Price histogram of Blackberry Bold 9900 posts with an inverse Gaussian distribution of suspicion.................................................................................................... 36
Figure 3-10 - Diagram of process of the framework undergoes while determining suspicious posts................................................................................................................. 44
Figure 3-11 - Non-uniform distribution of training set posts; .......................................... 45
Figure 3-12 - Highest fitness classification patterns for three iterations; ......................... 46
Figure 3-13 - Uniform distribution of training set posts;.................................................. 47
Figure 3-14 – Example of post containing aliasing information for “Blackberry” and “BB” based on extracted classification patterns. .............................................................. 49
Figure 3-15 – Price histogram of Blackberry Bold 9900;................................................. 49
Figure 3-16 – More granular price histogram of Blackberry Bold 9900; ......................... 50
Figure 3-17 – Blackberry Bold 9900 Price Histogram resembles that of function |𝐬𝐢𝐧𝒙𝒙|;........................................................................................................................................... 51
vii
Figure 3-18 – A post from a suspicious cluster that has a low price and indicates that only “cash pick up” is acceptable, ............................................................................................. 52
Figure 3-19 – Post of suspicious cluster that uses a stock photo and requires that the transaction occurs in person. ............................................................................................. 52
Figure 3-20 – Price histogram of IPhone 4s; .................................................................... 53
Figure 3-21 – An example of possible semantic analysis;................................................ 58
Figure 4-1 – Fully annotated relations between nodes of the domain information tree. .. 61
Figure 4-2 – Limited annotated relations between nodes of the domain information tree............................................................................................................................................ 61
Figure 4-3 – Displaying user input process as it traverses down the domain information tree;.................................................................................................................................... 63
Figure 4-4 – An example of what the table structure and potential table input for report
information attributes. ....................................................................................................... 65
Figure 4-5 – This result displays the input requirements based on which node is selected;
........................................................................................................................................... 66
Figure 4-6 – The domain knowledge tree that was used for Figure 4-5. .......................... 66
Figure 4-7 - A graph of Equation (7) displaying the confidence as the number of
transitions increases. ......................................................................................................... 68
Figure 4-8 – An example of how users’ would select the “Other” option should their
desired category not exist within the domain knowledge tree. ......................................... 72
Figure 4-9 – An example of how attribute prediction would function; ............................ 72
Figure 4-10 - Results of the search process; ..................................................................... 75
Figure 5-1- Metro pass histogram using 5% intervals. ..................................................... 80
Figure 5-2- Metropass histogram using 2.5% intervals and extended lower bound. ........ 80
Figure 5-3 - Percentage of image duplications on multiple posts;.................................... 84
viii
List of Abbreviations
AI Artificial Intelligence
BWNT Brand New With Tags
BT Broad Term
DE Differential Evolution
NER Named Entity Recognition
NLP Natural Language Processing
NT Narrow Term
PoS Parts of Speech
RT Related Term
TF-IDF Term Frequency – Inverse Document Frequency
1
Chapter 1
Introduction
1.1 Background and motivation
With the recent shift towards Internet based medium, new venues for selling
goods have emerged. Businesses often maintain a web presence if not additionally
offering ways to shop online through their site. Consumers can sell their new or used
goods through a variety of dedicated sites such as EBay, Kijiji, and Craigslist, while also
using social media platforms such as Facebook or Twitter to advertise these items. While
in a traditional medium, business-to-consumer and business-to-business sales accounted
for a majority of the market, with the advance of the Internet consumer-to-consumer sales
have begun to increase. Due to the rise in the perceived legitimacy of customer-to-
customer transactions in addition to the maturity of the internet, a majority of these
transactions occur online. Unfortunately illegal activities have also migrated to this new
medium, providing more anonymity over traditional methods of selling stolen goods such
as at flea markets or pawn shops. Traditional mediums performed the functional role of a
fence, either knowingly or otherwise, which is described as buying stolen property for the
purpose of later resale [1].
As online shopping has gained immense popularity in recent years, criminals have
also started disposing of stolen goods on the Internet using sites such as cash4gold.com,
eBay, and Craigslist. This forms a new type of fencing called E-Fencing [2]; which is the
act of fencing on the Internet. Given the benefits of this new medium, it is not surprising
that a lot of stolen goods are sold on these sites. In today’s Internet marketing era there
are various reasons that criminals choose to sell their stolen goods online rather than
using flea markets or pawn shops. First, these websites can be a great place to reach large
audiences, and as a result, stolen goods are sold quickly; additionally, the large volume of
similar ads slightly obfuscates the seller due to the large variation in the amount of detail
and asking prices. Since the criminals must balance the risk of holding onto the item for a
prolonged period of time against the risk of having their price cause suspicion, the overall
2
risk decreases as the audience and number of similar ads increases; as there is more
potential for a sale with reduced suspicion provided the price conforms to the market’s
price distribution. Second and most importantly, it is not very hard to mask one’s identity
online and remain anonymous. For example a criminal could use a 3rd party IP address
using a proxy, and surf the web anonymously; the criminal could also use pseudonyms
online to conceal his/her real identity. With the advent of software packages like Tor [3]
users are able to dynamically send traffic through a series of proxy servers making it
extremely difficult to trace. This issue is further complicated by sites also obfuscating or
aliasing users within their databases, such that the same user on different sites may show
no correlation.
However, in the past, people could sometimes recover their stolen goods by
searching local flea markets or pawn shops for their property, then reporting to the police
where their property is located. The police could then track down the criminal by their
fingerprints or identifications left when they brought the goods to the shop. In a similar
fashion there are accounts of people manually searching craigslist in attempt to find their
stolen goods, but this is very tedious. In these cases people often are just manually
searching craigslist or other customer-to-customer selling sites, inputting the details of
their stolen item and analyzing the search results; they are comparing the posts date/time
to the time of the theft and looking for any unique characteristics in the images that
would identify their specific item, often relying on gut instincts to identify their item. The
last aspect is actually both quite important and very difficult to quantify, since we do not
know what characteristics make the item appear suspicious to the subconscious. There
have been a number of moderately successful cases of users manually recovering their
stolen property; such as Jake Gillum [4], who after discovering his bike had been stolen,
began searching online sites to see if his bike would show up. Many of such documented
cases had aesthetic identifiers that differentiated them from their standard model, such as
user modifications or identifying marks of wear and tear. There are other methods of
tracking down lost or stolen property; many modern electronics such as cellphones have
the ability to install GPS tracking applications. However this often must be done
proactively, and does not give a contact method for police to further investigate beyond
3
the physical location. There have been accounts of people contacting the police stating
that they know where their stolen phone is located, but this may not give the police
sufficient grounds to confront the person and search them. As such, this makes cell
phones quite lucrative items to steal, due to their wide availability, high value, and large
volume.
Sadly there are many accounts of people successfully finding their stolen item but
often encounter quite a bit of resistance from the police department in larger cities. This
resistance is understandable due to the large number of reports they receive regarding
stolen property. Often the victim will stage a meeting with the perpetrator but only to
have the police arrive hours later. An example of this would be Jake Gillum, who tracked
down his stolen bike on Craigslist and had to wait 40 minutes for the police to arrive
while he stalled the seller. This displays the complex issue of attempting to recover stolen
property, due to the large volume of cases reported many police departments are simply
overwhelmed and victims are frustrated. This can result in the victim feeling like they are
not being helped and attempting to steal back their property against the recommendation
of the police. An example of this would be Kenneth Schmidgall [5], who tracked down
his stolen IPhone but was then beaten badly when attempting to confront the criminal.
Sites such as stolen911.com [6] have attempted to alleviate some of these issues
by assisting with the searching process. However, this system is restricted to searching
craigslist and does not cross-reference other posting sites. Additionally it does not appear
that any search processing is being done to return more accurate listings related to the
users reported case but simply returning all lists that match the users search criteria. This
leaves the victim and police to search and compare the results with the users reported
stolen item. It should be noted that their search engine leverages the Google custom
search engine to target the craigslist site. Although Google’s inference and searching
logics are very powerful, they are not domain specific nor do they leverage domain
expertise to return more accurate results. This is simply due to the structure and scope of
Google’s search interface, it does not generate a domain specific hierarchy that would
help classification or result reduction. Many issues arise because of users leveraging
4
search engines, such as posts advertising many models in order to be returned in multiple
searches.
Considering these issues, there presents a need for a system that automatically
interfaces with police and the victim, managing the victims’ expectations and informing
them of the status of the investigation, while also reducing the load by automating the
searching process for the police. This would help reduce the likelihood of the victim
personally attempting to recover their property while increasing the polices ability to
respond to active leads on time sensitive theft cases where it is expected the thief will
attempt to sell the item as fast as possible.
1.2 Objectives and methodologies
The first objective of this thesis is to propose a framework that could achieve the
desired task of automating the searching process victims’ encounter when trying to
recover their stolen property. In order to achieve this and simplify the problem we broke
it into two portions, a classification portion and a comparison portion. The classification
portion is responsible for gathering and classifying online posts, while the comparison
portion is responsible for gathering users reported cases and matching them to the
classified posts. While both of these are comprised of sub-components which will be
discussed in greater detail in later objectives, the classification portion has three main
components handling: ad retrieval and domain classification, suspiciousness
classification, and rule extraction. The comparison portion has two main components
handling: user input of reported stolen items and matching reported cases of stolen items
to classified posts.
The second objective of this thesis is to design an automated approach for
classifying items being sold online. In order to achieve this we had to leverage an
artificial intelligence approach to identify patterns within the data clusters and arrange
these clusters in a hierarchical manner. A complete analysis of applicable AI models will
be covered in a later section but we choose to use differential evolution encoded with
5
keywords to extract classification patterns. Once these patterns were extracted their
position in the hierarchy was most often determined by their length due to the extraction
process choosing more narrow keywords as the classification pattern extended. Thus for
each iteration of the classification pattern, we would transition to a more descriptive
subset.
The third objective of this thesis is to design an automated approach for identifying
potentially stolen items. This objective although very similar to the first is inherently
different due to the fact that we are attempting to quantify suspiciousness. Using some of
the techniques from law enforcement, we attempt to identify primary and secondary
attributes of suspiciousness, the most predominant being selling price. Using the price
attribute, we attempt to cluster the posts into varying price ranges and look for anomalies
within the price distribution.
The fourth objective of this thesis is to extract new patterns from the identified
suspicious clusters to allow for new attributes to be discovered. These patterns would
either increase or decrease the suspiciousness of the post based on how they were
applied; for example, should the pattern explain that the price is lower than expected due
to a defect or damages then suspicion can be decreased, conversely should the pattern
indicate that the seller only excepts cash or wants to arrange a meeting away from their
residence then suspicion can be increased. Although a very primitive approach was taken
to demonstrate this capability more complex approaches are discussed in a later section
and could be added in future work. We extracted new patterns using the same
classification system that was initially created except analyzing specific suspicious
clusters or portions of suspicious clusters while using the remainder of the dataset as a
control. This allowed for patterns unique to that price range or cluster to emerge.
The fifth objective of this thesis is to propose a framework that will act as an
interface between the victims of theft and the police. This is to address the large number
of reported cases of theft and limited resources of the police, allowing for some of the
load to be offset to an automated system. In order to achieve this, we had to determine
6
what information would be most likely requested in a police report, along with how much
practical technical information could we expect the victim to know while reducing the
chance of errors. Ultimately this was achieved by leveraging large amounts of domain
knowledge and strict user input, such that the user could navigate down to the appropriate
domain with minimal complexity. At which point they would enter a minimal amount of
information, which would be determined by the domain knowledge, with strict input
control parameters.
The sixth objective of this thesis is to automatically match reported cases of theft
with suspicious posts found online. In order to achieve this we had to analyze the data
used during; the interactions between the victims and the police, the interactions between
the sellers and consumers of goods in an online medium, and the crossover of the data
between these two interactions. This analysis allows us to understand how to transition
from user input that was strictly control and heavily influenced by domain knowledge to
user input that was free-form text and has varying degrees of influence by domain
knowledge, such that there would be a way to compare the reported cases to the classified
posts. This was achieved by using other primary attributes such as date, time, and
location, along with classification patterns and domain knowledge to find the best match
to a reported case.
1.3 Contributions
This research focuses on developing a framework that automates the search
process users or police would go through in attempt to manually recover a stolen item,
specifically the main contributions of this research are three fold:
- A robust domain portable framework is introduced, designed in such a way that
allows it to be modular and possess extremely high domain portability such that it can
be applied to new domains with very little modification. Modules were designed such
that they can function with very little initial domain knowledge, while adapting over
time as the domain knowledge increases. Additionally modules were provided with
methods for harvest new domain knowledge or trigger domain knowledge updates
7
based on the data being encountered; such that the framework was able to handle and
identify information from new domains from both a system classification and user
input standpoint.
- The introduction of the classification system allows for posts from various sites to be
accurately merged into consistent domain classifications while extracting new domain
classifications. This system, in addition to documenting the primary and secondary
decision attributes, provides a basic implementation for these attributes. These
attributes are used when attempting to quantify the suspiciousness nature of a post,
which previously was unquantifiable. This not only provides an initial metric or
training set for other approaches in the form of a publicly available dataset of online
posts with domain classifications and legitimacy quantifications but also serves as a
dataset that is further analyzed by rule extraction module. Rule extraction serves as a
process to either strengthen the decision of the posts suspiciousness or to explain as to
why it may have been misclassified as suspicious, which increases the accuracy of the
system over time.
- The introduction of the reporting system allows for users to submit reported cases of
stolen items with information similar to that which is provided in a police report. This
interface is expected to bridge the gap between law enforcement report databases and
online media. The method in which this is accomplished identifies the limits of the
user’s ability to input information from both a logical and technical standpoint, and
thus prompts them only for the required information in a linear fashion. This
approach allows us to extrapolate semantic information about new domains without
halting the reporting process. This research is also the first that is attempted to
connect the elements of the reported stolen item to actual online posts, that is to say
utilizing the theft metadata of date, time, and location to reduce the search results.
8
1.4 Thesis organization
This thesis consists of seven chapters, chapter 1 presented an introduction to the
work and the subsequent chapters will be organized according to the following;
Chapter 2: In this chapter a number of related works are surveyed in the domains
of criminal psychology, real-world applications that attempt to address this
problem, and various methods of natural langue processing when attempting to
process online text.
Chapter 3: This chapter describes the proposed framework for automated stolen
property recovery in detail. Explaining how the various components of the
framework interact and how information flows through the system to achieve the
desired results. In addition it will describe the classification system in detail,
along with an analysis of the AI selection and training process.
Chapter 4: In this chapter the reporting system is described in detail, specifically
how the domain information is leveraged during the user input process, how the
user input is restricted and controlled, and how the comparison between the users
reported data and the classified posts is handled.
Chapter 5: This chapter presents other domains that the framework can be
extended to and the adaptations that had to be done to achieve domain portability.
The two domains that are explored are rental property scams and metro pass
scams, each of which presents unique domain information requirements while
also demonstrating key attributes that were previously undervalued in other
domains; displaying the variance in attribute weights amongst domains.
Chapter 6: This chapter briefly summarizes the key outcomes of the proposed
framework, and presents some suggestions for future research directions of this
work.
9
Chapter 2
Related Work
There have been a number of accounts of people successfully recovering their
stolen items manually, and even groups such as the Portland Oregon’s Burglary
Taskforce Unit can assist in this process. However this process is far from easy and
requires a lot of time to periodically search a number of sites for your stolen property.
Academically there has been very little work done in attempts to address this issue from a
technical standpoint; however there have been a number of academic works done from a
criminology standpoint. As we discuss these works in more detail, we will be leveraging
some of their methodologies or logics in order to develop the proposed framework; as
such they will form the foundation for the automation of stolen property recovery.
Many of the academic works in this area, such as [7], attempt to document the
methodologies of the criminals during the sale of stolen or fraudulent items in an online
medium. This information is extremely valuable since any technical implementation will
be leveraging the identified characteristics or attributes from these works. A majority of
the cases we look at fall into the category of “short firm” frauds where the time line of the
transaction is relatively short. This should be the case for a majority of small time
criminals due to the nature of the medium and the fact that the items are being sold to the
public; however it should be addressed that career criminals may use specialized online
sites for “long firm” frauds where the goal is not the initial sale of the item but
subsequent fraudulent sales. We would expect to see this from large sales targeting
business but this is beyond the scope of this thesis. Tradwell’s work documents the two
major advantages of selling items online, or reasons why criminal activities have shifted
to sites such as EBay, is the fact that items can be sold over greater distances and
language barriers are overcome by commonly accepted acronyms or abbreviations; an
example of this would be “Brand New with Tags” being represented as BNWT. It also
should be noted that many of the interviewees saw this medium as risk free, although
many sites claim to investigate fraudulent listings they have seen little action of this; it is
even noted that sites provide settings where taxation revenues are avoided. Tradwell also
10
states that the criminals he interviewed had the sole objective of a healthy profit margin
and were impartial to the legality of the venture. This fact, although obvious, does
indicate that price and profit are the most crucial elements in reducing the sale of stolen
goods online; that is to say if we can reduce the profit margin or increase the risk such
that it’s no longer viable to sell these items, this framework will have a positive impact.
Despite this research, there have been very few technical academic attempts to
combat this issue and a majority of the time the site themselves attempt to implement
some sort of fraud detection. However, these approaches differ from site to site and there
is no imperial measurement on the success of these techniques.
The Portland Oregon’s Burglary Taskforce Unit [8] documents ways that
individuals may attempt to recover their stolen property online. Addressing the fact that
stolen items may often turn up on sites such as Craigslist within hours of being stolen,
and thus if people are attempting to manually recover their stolen property they should
check very soon after the theft and consistently afterwards. Although their
recommendations are very good and will form the basis for our approach and assumption,
this process is very tedious for the victim due to the number of sites available to sell the
item and the large volume of posts for a given item.
Sites such as stolen911.com [6] attempt to address this issue directly, allowing
users to report stolen property and search Craigslist through their site. However from the
user’s perspective it is difficult to say what form of integration and processing is
occurring with Craigslist posts. Their site only appears to be a reporting mechanic for
stolen property, where the user can offer a reward should the item be recovered; allowing
major search engines such as Google to index their reported cases which may trigger a
sites fraud detection mechanism. They also provide users with a Google customer search
interface for the Craigslist site, streamlining the searching process for users by restricting
it to just Craigslist and searching all Craigslist area listings. Although this is a very good
initial attempt to address this issue, the amount of information in each reported case is
inconsistent; allowing the users to enter free-form text which results in information being
11
in the incorrect field or in varying formats. An example of this would be the following
reports for blackberry, both of which have sufficient information however the structure of
the information is different between them and the lack of domain information results in
errors. In the first example seen in Figure 2-1, we can see that the time is included in two
fields and there are minor technical discrepancies between them. In the second example
seen in Figure 2-2, we can see that the model color is being included as part of the model.
This displays the need for a deep understanding of the domain information if we
want to automate the comparison of reported cases to online posts as well as very strict
control over the user input. This also presents a classification problem when trying to
determine which domain information should be used when classifying the post.
When looking at the format of text in online posts we quickly see that many
traditional methods of natural language processing (NLP) and named entity recognition
(NER) approaches fall short due to the imperfect nature of user generated content online.
Figure 2-2 - An example of a reported case of
theft from [10]; containing much less detail than
its counterpart in Figure 2-1, the model
information includes the color of the phone which
should be a separate field.
Figure 2-1 - An example of a reported case of
theft from [9]; although it’s an extremely
detailed report, information regarding the time
of the theft appears in multiple fields with
inconsistent technical information.
12
These systems are often heavily reliant on sentence structure to resolve parts of
speech (PoS)[11][12], which in turn are used for feature selection during the NLP
process[13]. Handling online text generally means many spelling and grammar mistakes,
often lacking sentence structure, and many abbreviations; the lack of sentence structure is
especially true in the case of online posts. This prevents effective PoS disambiguation
due to the relatively small sentence size and domain specific contexts. Due to these
issues, traditionally trained NLP techniques would not be applicable and would require
retraining for either very simple or extremely complex rules; this would often result in
very high false positives or misclassifications, and would require a dedicated training set
for each target domain. This unfortunately makes this approach more of an expert system
only trained in the target domain and requires excessive overhead to train the system.
Similarly the use of NER approaches discussed in [14] to detect models or other key
words, which would serve as identifiable characteristics of a post, also fails due to similar
issues with grammar and sentence structure. Although this approach is slightly more
robust due to its reduced training requirements, the number of possible abbreviations
results in a large number of training post being required in order to identify specific
abbreviations; additionally fundamental assumptions such as capitalization and proper
punctuation are not valid for online text, which are traditionally used as boundary
markers to identify entities.
Due to the fact that existing approaches cannot directly be applied to solve this
issue, we must leverage the successful elements of the manual recovery process and other
works discussed in order to fully automate this process. We propose a framework to
address these issues by automatically pull ads from major traffic sites such as Craigslist,
Kijiji, and other popular classified ad sites to process them. This would involve
determining what the ads were attempting to sell, including gathering characteristics
about the item, and extracting useful information about trends in the markets. Posts that
do not follow the market trends or expected rules would be matched against reported
stolen items; an example of this would be if there are a larger number of posts then
expected at an obscure price relative to the normal market distribution then we would
compare these posts against reported cases. This approach is different than the other
13
systems attempting to solve this issue because it converges the search process for many
popular classified posting sites into a single system, while attempting to classify their
posts as to the actual item being sold. This framework would dynamically clusters posts
with the same classification to extract characteristics about them; allows for the system to
adapt to new items introduced to the market and even new domains with very little initial
domain knowledge. Optimally no domain knowledge would need manual entry to the
system over its lifespan. This clustering also allows for dynamic rule generation to
identify immerging suspicious trends in the same manner it would identify new models.
The proposed use of a framework to automate this process has not really been approached
in the past and because of this, training data for identifying suspicious posts is not
available. Future work in this area would be able to leverage our systems classifications
of suspicious and non-suspicious data sets as both a training set and/or a benchmark for
their work.
14
Chapter 3
E-Fencing Detection Framework
In this chapter we will be presenting the proposed framework which is comprised
of the following major components; ad retrieval, ad classification, rule database/rule
extraction, and reporting system. The framework as a whole can be divided into two
parts, the first being the passive identification of suspicious posts and the second being
the active searching for matches to reported stolen items. In this chapter we will focus
primarily on the passive identification components, specifically the ad retrieval, ad
classification, and some aspects of the rule database, in which we propose a novel
approach which analyzes classified ad websites for stolen goods. First we will describe
the framework as a whole, and then describe in detail each component of the system.
The proposed framework begins by regularly crawl websites such as Kijiji,
Craigslist, etc. and fetching pages or posts. This phase can be referred to as the Ad
Retrieval phase and occurs periodically to maintain an up to date database of posts from
these websites. We then introduce a category-based information retrieval system to
extract characteristics of the item in the posting, along with all the posting metadata such
as time, price, seller, location, etc. Although websites often pre-classify their postings
into a set of manually predefined categories, this information may not always be accurate
due to user error or placing the item in the incorrect category. Users may even
intentionally misclassify their item to increase the number of views their post gets.
Because of the potential of incorrect classifications, all posts are processed through the
classification system to get the best classification within the system.
Once the posts have been roughly categorized into clusters, further analysis of the
categories can be done. This involves attempting to extract domain knowledge of the
given topic, such as average selling price and standard deviation per model, models
available, etc. This analysis would result in two kinds of domain specific knowledge,
identifiable characteristics and “suspicious” characteristics. An example of identifiable
characteristics would be something that further breaks down the category of Blackberry
15
into the various colors of the models, which would decrease the number of posts that
would have to be searched if someone was looking for a specific color of Blackberry.
This knowledge is used in place of or to further enhance the initial domain knowledge
provided to the system. Similarly an example of “suspicious” characteristics would be
the extracted trend of current market price of the model, posts that are selling for far less
than that would be suspicious. This process also alleviates the need to constantly update
the system with the newest domain knowledge or market price as the system recalibrates
the various classifications and market price automatically. It should be noted that an
initial sanity check may have to be conducted initially to prevent a large number of false
ads from skewing the market trend to a lower price.
Once a category has had its key features extracted, these can be stored and
referenced in a rule database so that all posts do not need to be reanalyzed. Once
categories features have been defined, posts that fall outside the normal distribution or
violate a rule are flagged as suspicious, the more rules violated increases the confidence
of the suspiciousness. An example of this would be someone selling a phone for far less
than the market average, or using a stock photo or a photo from another post that we’ve
already processed.
At this point the active identification process begins, utilizing the previously
derived information to match reported cases. Users will submit a report for their stolen
item through the reporting system, allowing us to control the input method for the
characteristics of the item, while also allowing the system to leverage any domain
knowledge to prompt for key characteristics. This removes the needs to apply any natural
language processing techniques to the users input since it is fixed field input. Once the
user has submitted the information about their stolen item the system will attempt to
match it to the classified posts and return any potential matches to the authorities. This
searching priority leverages the fact that we have identified suspicious posts and thus
they should be searched first, but will continue searching non-suspicious posts should
there be insufficient confidence in previous matches. This addresses the fact that there
may be better matches in non-suspicious posts or that the system is simply not calibrated
16
to detect something that law enforcement agencies can. This two stage process of
searching reduces the initial search domain by searching suspected stolen goods first,
followed by other posts. Should the system rules be accurate enough, the searching
process could be stopped after the first stage of checking suspicious posts. A diagram of
how these components interact can be seen below Figure 3-1.
Kijiji Craigslist Etc.
Ad Retrieval
Classifier
Clustered
Data
Rules
Suspicious
Posts
Normal
Posts
Reporting
System
Check 1
Check 2
Rule
Database
Figure 3-1 – Proposed framework overview; the system first retrieves ads from various sites,
followed by classifying and clustering these posts from which rules can be applied or extracted. Once
the data is classified and separated into suspicious and non-suspicious designations they can be
searched based on user reports submitted to the reporting system. This process will loop searching
all newly classified posts until a match is found.
To illustrate how the proposed system works, let us use an example of a cell
phone as a case study. The system will retrieve all posts from classified ad sites, which
the classifier will then automatically attempt to classify posts based on keywords;
therefor posts containing the cellphone manufacturer RIM (Research In Motion) will be
clustered together. It should be noted that many other clusters would be formed before
this but for the purposes of this case study we will only focus on those that happen after
this initial clustering. From this point sub-clusters would form based around cellphone
characteristics, such as model, color, condition, or specifications. We would expect to see
clusters of characteristics such as “Blackberry”, “Bold”, “32 GB”, etc.
17
Once all classifications have been done, these clusters would then be analyzed by
the rule database, checking for suspicious word strings, abnormally low prices, or other
characteristics that would lead to the item being deemed suspicious. From this point
suspicious posts are labeled and have a higher priority in the searching process that is to
follow.
At this point any reported stolen items submitted to the reporting system, where
the active component of the searching starts, are then compared with the database of
posts to find the best match. The users will be asked to give as much information
regarding the item that was stolen, time, and location during the reporting process. These
inputs will be rigidly defined to increase the accuracy of the system, either by leveraging
external domain knowledge or later extracting the domain knowledge from the
classification clusters. This input for the system could be pulled either from a direct
interface engine or from various law authority databases for stolen property such as
Trace, America’s largest law enforcement database for reported stolen property. For this
case study we will assume someone reports that their Blackberry Bold 9900 was stolen
on April 10th 2014 around 10pm from their residence, the items condition is moderately
used with a notable scratch on the screen. The system would check for posts that had
been classified as suspicious and as “Blackberry bold 9900” that were posted after the
time of theft and in close proximity to the location of theft. Matching posts would further
be refined by checking for keywords that may indicate the condition of the item or
indicate the identifying characteristic of the scratch mark. However it should be noted
that this secondary process would not remove candidates from the matching process but
simply increase the confidence of those that matched additional characteristic, and
inversely decrease the confidence of those that displayed conflicting characteristics.
Matches found by the system would be returned to the police for further investigation.
Now that we have an understanding of how the proposed system works, we will
go into detail about each of the systems components in subsections to understand how
18
they function and design considerations that can alter how they function. We will begin
by looking at how the ad retrieval process was designed and how it functions.
3.1 Ad Retrieval
The automation of ad retrieval is key to allowing for a huge number of ads to be
checked, while also maintaining up to date average market prices for items. This involves
parsing entire sites for all postings, in the case of sites dedicated to the target domain, or
parsing specific portions of sites, in the case of sites with multiple domains such as
Craigslist. Currently all sites that we want to be parsed must be manually defined
however future improvements may allow for automated site discovery.
For dedicated sites a simple regex matching the HTML anchor tags “<a href>”
and “<\a>” will allow for all links within the site to be parsed that are accessible to the
public. The links can be restricted by using relative or absolute references to the current
site domain, removing the possibility of parsing outside the site domain. It should be
noted that parsing the entire site can result in many pages that do not have relevant
information, such as the main page, indexing/search pages, terms of service, etc.
However this does not pose a problem due to how the classification process occurs,
which is described in detail in section 3.2. For the purposes of this section just know that
these pages will not be classified due to lacking characteristics and thus will not impact
the classification or rule extraction phases.
For sites that have multiple domains such as Craigslist and Kijiji, a similar
approach can be used to restrict the links to be within a subdomain of the site or manually
specifying the structure of the page and which portions should be parsed. In some cases
manually defined characteristics of post pages must be defined, such as pages that have
no further links, to prevent cyclical parsing. An example of this would be pages that
contain a user’s ad may also contain a link to the websites home page, which would result
in cyclic parsing of the site if included. This can alliteratively be solved by keeping an
array of parsed URLs for a given site during the retrieval process, checking the array to
see if the page has already been retrieved.
19
Posts that are more than a few months old would be filtered prior to classification
to ensure that very old posts are not being parsed during the classification and price
extraction processes; preventing them from skewing the classification patterns or market
prices. For sites such as Craigslist, they restrict retrieving posts more than 6 months old
automatically; for sites that do not limit the age of the posts, they are filtered manually
during this process.
Once these ads have been parsed they are saved to either the local machine or a
database for reference should the ad be removed from the site. The name of each post is a
combination of the post ID/URL with respect to the site or a hash of the post content.
This serves two purposes, the first allowing the system to check if the page has already
been parsed and the second allowing the system to determine if the page has been
modified since its last downloaded. Now that we understand how the ad retrieval process
works we will look at how the classification system works.
3.2 Ad Classification
Given the unknown number of possible classifications, the system needs to
identify classification patterns automatically with as little initial domain knowledge as
possible. Limiting the amount of initial domain knowledge required keeps the system
flexible to recognize sub-trends while also requiring less human intervention to update
the domain knowledge. When looking at how this could be achieved, it was clear that an
AI approach would be needed but many factors impacted how it should be implemented.
The following sub-sections will look at each of these factors and how they were
addressed.
3.2.1 AI Model
When first looking at how such a system could be trained; there is the option of a
supervised system which would require labeled training data with explicit examples for
20
each classification or an unsupervised system which would require unlabeled training
data. Given that we do not know the number of classifications that actually exist, that a
large amount of domain knowledge is required to properly classify online posts, and the
amount of unlabeled online posts on the internet compared to the cost of labeling it; using
an unsupervised approach seems much more practical for this application. Although a
supervised system would be preferred for its faster training time, the cost of labeling
training data and the rate at which domain knowledge changes makes a supervised
system approach impractical.
When looking at how traditional clustering algorithms worked it was clear that
they largely only worked on data that could be numerically represented. In our case this
was not possible while maintaining the relations between the keywords. Since our
classifications were going to be keyword combinations, since we are dealing with natural
language as an input, translating this into a numerical representation would not be
possible until a classification had been done. As a keyword at position n would only be
related to its surrounding data n±y if the classifications are known when translating the
data into numerical form, where y is the range of the surrounding data. Another way to
look at this is maintain the linguistic characteristics that we want to identify while
translating the data into another form is very difficult, either requiring knowledge of these
linguistic characteristics in the form of domain knowledge or the system would need
supervised training data in order to derive them automatically. Due to this limitation we
had to look more specifically at which AI model should be used that would allow us to
solve the classification problem without translating data. We choose to take an
evolutionary computation approach, since our problem resembles the traveling salesman
problem, attempting to cover as many nodes as possible with a classification while not
overlapping other classifications.
Evolutionary computation algorithms traditionally functions as a global
optimization, leveraging the Darwinian process of cross-over and mutation altering
existing encoded data or chromosomes. The most basic approach is the genetic algorithm
or evolutionary algorithm, which has a population of size p whose chromosomes are
21
encoded with random values. Then cross-over is performed on the population based on
each population member’s relative fitness to the given problem. This results in members
that are closer to an optimum to have a higher chance of their encoded data to be in the
future populations. Mutation has a low chance of being performed on each resulted
member each iteration, this allows for members to break out of local optima and reach the
global optimum.
Considering these limitations an unsupervised AI approach leveraging differential
evolution to generate the classification patterns was chosen. Using differential evolution
over the traditional genetic algorithm provides the benefit of restricting child
chromosomes to have a higher fitness value than their parents. This provides a more
predictable convergence to optima while keeping high fitness chromosomes in the
population. Modifications to the transition process are covered in section 3.2.5 which
relaxes the transitions from exclusively higher to conditionally lower with another metric
being higher.
3.2.2 Chromosome Representation
One of the most important design considerations for any AI system is how the
data can be represented, in our case how the chromosome should be encoded, such that it
can be maximized appropriately. When looking at how the chromosomes could be
represented logically, it made sense to have them represented by an array of keywords;
especially given the difficulties presented earlier regarding the translation of natural
language data. We chose to converge to a keyword optima before incrementing the
chromosome array; this allows optima to be found for each array length rather than
encoding the array length in the chromosome. The reason for this is that we don’t know
the optimal array length and encoding it within the chromosome would result in lower
array lengths being better or “less wrong” since there would be less keywords to impact
the fitness of the chromosome; because of this issue having the chromosome length
increase each iteration results in more accurate and consistent convergence. This process
has the byproduct of producing a keyword hierarchy for the classification. The
22
classification process takes a “one vs. all” approach, attempting to find a single pattern
that classified the largest number of posts within the current training set. Each iteration
increases the chromosome length until the fitness of the pattern decreased from the
previous iteration’s optimum. This ensures that the classification pattern is complete and
that all words capable of belonging to that pattern are reasonably exhausted.
Similar to the issues discussed in [15], where text summarization would tend
towards words like “the” unless comparative weighting such as TF-IDF is used; the issue
of transitional words or “noisy” words being prime candidates for classifications patterns
was discovered. To address this issue we introduce the concept of a training set and a
global set. The training set is the group of data that we want to derive classification
patterns from, while the global set serves as a uniformly distributed set of various non-
target domains to dampen this noise. This allows for word frequency in the training set to
be weighed against the global set to determine if the word was truly unique to the target
domain. This was done for two reasons; the first being to remove common words that
would appear is all posts, and the second being that it allowed for a recursive structure to
further refine classification sets. An example of this would be subset of posts that
matched a specific keyword pattern, say “Blackberry Bold 9900”, could be input as the
training set while all other posts containing the keyword “Blackberry Bold” would be
input as the global set. This allows for more specific subcategories to be recognized.
3.2.3 Fitness Overview
The next largest design consideration is how the fitness of the chromosome
should be measured. The fitness calculation is absolutely crucial to the accuracy of the
system given that it controls how optimization occurs. In order to understand how fitness
will be calculated we must first discuss the concept of coverage and how we want the
fitness to behave within the system.
23
Coverage is defined as the following:
The number of posts that contained a given word at least once, for multi-word
patterns the intersection of these sets is taken.
Coverage was chosen over raw word frequency to avoid the amplified noise from
transition word. This concept of coverage is heavily used throughout the fitness
calculation.
The training and global sets help identify the linguistic patterns that are unique to
the target domain without any prior knowledge of the target domain. Unfortunately there
are often many linguistic patterns in the training set, and with the nature of maximization
algorithms, they will naturally converge to a single optimum. Preventing local optima
from being discarded in favor of other classification optima is quite difficult but is
required to discover multiple classification patterns that exist in the training set.
Specifically in genetic algorithms or differential evolution, the algorithm will attempt to
maximize a chromosomes’ fitness value, so there must be a way to preventing lower
fitness patterns from being removed from the population if it is potentially valuable.
Similar to how differential evolution improves the genetic algorithm by only allowing
chromosome transitions when the fitness is higher, we introduce other restrictions to
achieve this. We limit what transitions could occur between chromosome values such that
it met the criteria below:
The fitness value of the new chromosome should be at least 90% of the current
chromosome’s fitness. The variation in percentage allows for the transition to
non-increasing fitness supersets and then later into more specific subsets.
Although fitness does account for the coverage size of a set, having this variation
allows for more consistent transition to the global optimum for a given
classification pattern.
24
The new chromosome must be a superset of the current set described by the
chromosome or the ratio of training posts coverage lost must be outweighed by
the ratio of fitness gained. This limits the number of transitions between similar
fitness optima.
These restrictions prevent lower fitness patterns from converging to a single
global optimum even when there is a small fitness gain. An alternate approach restricting
transitions based on a fitness gain thresholds displayed poor results, slowing down the
convergence while only protecting the n-best optima within the thresholds window of the
global optimum. Although the two restrictions above do not inherently prevent all
chromosomes from converging to a single optimum, it does assist in resolving multiple
clusters that have little to no intersections if multiple patterns exist. This property of not
explicitly restricting transition outside of its current subset is required to allow
chromosomes initially seeded with noisy patterns to transition into subsets of actual
patterns. An example of how these restrictions would be applied can be seen in Figure
3-2, a transition from a chromosome with the word “bold” to the word “Blackberry”
would be allowed since most of the training posts that contain the word “bold” also
contain the word “Blackberry”. While a transition from a chromosome with the word
“IPhone” to the word “Blackberry” would not be allowed unless the difference in fitness
values was extremely large. In this case the “IPhone” subset would likely have
insufficient representation in the training set and be interpreted as noise until a
classification pattern removes “Blackberry” and “Bold” from the training set.
Figure 3-2 – Visual representation of classification pattern sizes and their intersections.
Chromosomes with the word “bold” would be allowed to transition to the word “Blackberry” due to
their high degree of overlap.
25
In the event that there is insufficient representation of a pattern in the training set
then it would likely be interpreted as noise. An example of this can be seen in Figure 3-3,
however this will be resolved in later cycles of the classification algorithm when there is
either more data for the given set or the posts that belong to a classification exist the
training set. Now that we have a basic understanding of how the chromosome fitness will
be evaluated, we will go into detail about how this is calculated.
Figure 3-3 - Frequency analysis of words in training post set; true optima “Blackberry” and “9900”
can be seen while “Bold” is obscured by noise. Sub-trends such as “IPhone” are present but are
obscured by noise.
26
3.2.4 Fitness Function
Coverage can be more formally shown as the following equation:
𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆 = ∑ 𝑷𝒐𝒔𝒕 𝑺𝒆𝒕𝒊 ⋂ 𝑪𝒉𝒓𝒐𝒎𝒋
𝑪𝒉𝒓𝒐𝒎𝑺𝒊𝒛𝒆
𝒋=𝒊
𝑷𝒐𝒔𝒕 𝑺𝒆𝒕𝑺𝒊𝒛𝒆
𝒊=𝟏 (1)
The fitness calculation is broken into 3 components listed below:
𝑭𝟏 = 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎
𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎−𝟏 , 𝒘𝒉𝒆𝒓𝒆 𝟎 ≤ 𝑭𝟏 ≤ 𝟏 (2)
The first factor calculated with Equation (2), checks if the new word has value
compared to the previous n – 1 chromosome words relative to the training set. F1 is
bounded within these ranges due to the fact that this fraction can never exceed 1 and by
the second iteration the previous chromosome will have converged to a non-zero fitness
value which would require that F1 have a non-zero value during the first iteration,
resulting in the denominator having a non-zero value in the second iteration. During first
iteration when there is no previous chromosome value to compare with, initial
experiments used the entire training set size but this proved problematic as the
chromosome size increased since the previous chromosome would match much fewer
than that. This was effectively assuming that the previous pattern would match
everything, which could only be true if a single pattern existed in the training data. This
was later altered to have F1 being set to 1 for the first iteration. This change was done for
one main reason which cascades into many other calculations; it makes the F1 fitness
values comparable between iterations which in turn prevents poor fitness patterns in the
next iteration from having higher relative fitness to legitimate patterns in the previous
iteration. This not only prevents legitimate patterns from being removed after the first
iteration but also from artificially extending the pattern past its true length due to
incomparable fitness values between the first and second iteration. Although this has a
slight inverse effect on the first iteration, giving them slightly higher fitness values, this is
offset by the fact that as the chromosome size increased it is assumed that F2 would
alleviate the difference in fitness for highly overlapping sets; in which it is assumed that
the fitness value of F1 will remain reasonably close to 1.
27
𝑭𝟐 = 𝑪𝒉𝒓𝒐𝒎𝑺𝒊𝒛𝒆
𝒌 ∙ 𝑨𝒗𝒈 𝑷𝒐𝒔𝒕𝑺𝒊𝒛𝒆 , 𝒘𝒉𝒆𝒓𝒆 𝟎 ≤ 𝑭𝟐 ≤ 𝟏 (3)
The second factor calculated with Equation (3), weighs the length of the
chromosome; this is to offset the fact that as the chromosome length increases the subset
of training posts will decrease even for legitimate patterns. k represents a constant, which
varies based on the training set, it is used to normalize the impact of increased
chromosome length has on the fitness; such that chromosome length positively impacts
the fitness while the pattern is not overextending past the true trend length. Manipulating
the value of k will impact the value of the F1 and F3; increasing the value of k will
depreciate the value of the chromosome length and thus increase the relative weight on F1
and F3, conversely decreasing the value of k will increase the value of the chromosome
length and thus decrease the relative weight on F1 and F3. For the purpose of our
experiments we left the value of k as 1, which should be the case for new training
domains. Although F2 is normally bounded with the ranges of 0 and 1, there is a case
when F2 is unbounded; if the average post size is very small and the classification
pattern only exists within the larger posts then it is possible for F2 to achieve a value
greater than 1. In the event that the chromosome length ever surpasses the average
post length relative to k, then the iteration should terminate early due to instability
introduced by having values greater than 1. However termination should occur
naturally before this due to the fact that all matching posts with length smaller than
the new chromosome length will no longer match, resulting in a significant decrease
in F1; this decrease should not be able to be offset by the increase in F3.
𝑭𝟑 = 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎 − 𝑮𝒍𝒐𝒃𝒂𝒍 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎 ∙
𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕𝑺𝒊𝒛𝒆𝑮𝒍𝒐𝒃𝒂𝒍 𝑺𝒆𝒕𝑺𝒊𝒛𝒆 − 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕𝑺𝒊𝒛𝒆
𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕𝑺𝒊𝒛𝒆 (4)
The last factor calculated with Equation (4), determines the percentage of training
post matched compared to the number of global posts matched, this comparison is
normalized based on the respective size of each set. Although in the equation only the
global set is normalized because it is assumed that this set will be much larger than the
28
training set. F3 is bounded within the ranges of -1 and 1; this lower bound exists when the
pattern is more predominant in the global set then the training set.
These factors are then combined using a harmonic mean with Equation (5) to
produce a single metric that balances the various factors. This allows for variations in
parts of the fitness while heavily penalizing the fitness if a single attribute has a low
value. Harmonic means are used heavily throughout NLP classifications and such we
extend its application into the calculation of our fitness. A figure of the harmonic mean
between two values can be seen in Figure 3-4; where averaging the values would produce
a plane between 0 and 1, this curves the plane such that the greatest result occurs when
the two values are very similar to each other.
𝑭𝒊𝒕𝒏𝒆𝒔𝒔 = 𝟑 ∙ 𝑭𝟏 ∙ 𝑭𝟐 ∙ 𝑭𝟑
𝑭𝟏 + 𝑭𝟐 + |𝑭𝟑| , 𝒘𝒉𝒆𝒓𝒆 𝟎 ≤ 𝒇𝒊𝒕𝒏𝒆𝒔𝒔 ≤ 𝟏 (5)
The absolute value of F3 must be taken since its negative lower bound produces
an asymptotic plane where F1 + F2 = − F3. This allows a harmonic mean calculation to
incorporate F3 without substantially changing the shape of the function; in Figure 3-4
below we can see the normal shape of a harmonic mean between two elements spanning
the ranges of 0-1, while in Figure 3-5 below we can see the effects of allowing F3 to have
negative numbers while taking its absolute in the denominator.
Figure 3-4 – Harmonic mean of F1 and F2. Figure 3-5 – Harmonic mean of F2 and F3.
29
3.2.5 Population Transition
As previously mentioned, restrictions had to be put in place to prevent lower
fitness patterns from converging to a single higher fitness patterns. These conditions are
discussed in Section 3.2.3; firstly the test chromosome must have at least 90% of the
member’s fitness, and secondly the test chromosome must either be a superset of the
current chromosome or the ratio of coverage lost must be outweighed by the ratio of
fitness gained.
The second value is defined by the following condition:
𝑭𝒊𝒕𝒏𝒆𝒔𝒔 𝑪𝒉𝒓𝒐𝒎𝒕𝒆𝒔𝒕
𝑭𝒊𝒕𝒏𝒆𝒔𝒔 𝑪𝒉𝒓𝒐𝒎𝒎𝒆𝒎𝒃𝒆𝒓≥
𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎𝒎𝒆𝒎𝒃𝒆𝒓
𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎𝒕𝒆𝒔𝒕⋂ 𝑪𝒐𝒗𝒆𝒓𝒂𝒈𝒆𝑪𝒉𝒓𝒐𝒎𝒎𝒆𝒎𝒃𝒆𝒓
(6)
It should be noted that we only need to consider cases where coverage is lost due
to the fact that if coverage is gained, i.e. test chromosome is a superset, then the right
condition equals 1 and we are only looking at increasing or comparable fitness values.
Conversely the right condition increases in value as the sets separate or narrow, in which
case the ratio of fitness gain must be compared to that of the coverage lost. A graph of
this condition can be seen in Figure 3-6, where the transitions were monitored for a
generation to see which transitions were being allowed along with the range of transitions
occurring.
Figure 3-6 - Graph of passed vs. failed chromosome translations based on Equation (6). Red data
indicates failed translations; blue data indicates passed translations.
30
3.2.6 Termination Condition
Due to the nature of algorithm we cannot prove that it ever reaches the global
optimum, as such a termination condition must be used to approximate the amount of
time required for algorithm to potentially converge to the global optimum. Our goal
during the optimization is to find the best fitness chromosome classification for the
training set additionally finding any different classification patterns that may exist in the
training set if possible.
While most applications use either a fitness threshold or number of generations
before the system is assumed to be converged, seeing the results from Figure 3-7 we can
see that the population would often discover their best member relatively early on and
then simply have other members of the population slowly converge to this pattern. Given
this fact in our application we define the termination condition to be when the average
population fitness is greater than or equal to 75% of the current best fitness of the
population for 5 consecutive iterations. This percentage allows for the us to be relatively
certain that the best pattern has been found while the condition of consecutive iterations
ensures that it doesn’t terminate prematurely when the system is seeded with random
data; allowing the population to stabilize into various clusters before stopping, while also
assuming that from that point on lower fitness chromosomes were just converging to the
current best chromosome. This value would likely need to be tuned based on the training
Figure 3-7 – Displaying the average population fitness compared the best members fitness over 5
iterations. Also showing that termination should occur after the 3rd iteration due to the fitness
returned was lower than the previous iteration.
31
sets distance between clusters, the population’s size, and a variety of other factors for
different domains.
During our experiments we noticed that this can result in the algorithm not
converging in some datasets or converging too early based on the value of the termination
percentage. As a result we experimented with alternatively termination conditions that
had the same end goal. This included finding a polynomial curve that fit the population’s
average values and calculate the tangent at the current generation, if the tangent was
approaching zero for n consecutive generations then it was assumed that the population
had stabilized to its optima. This approach displayed similar results although being more
computationally expensive. Results of varying the termination condition can be found in
Section 3.5.5.
3.2.7 Data Clustering
Once this classification has been done, we assume that we’ve collected most of
the possible classification patterns that exist within the training set and continue to
prepare the data for processing the next section. This involves organizing raw posts into
clusters such that the data can be evaluated by the rule database and processed for rule
extraction. Organizing these patterns in a tree structure allows for hieratical matching
while allowing for partial classification patterns to be utilized in the future. An example
of this can be seen in Figure 3-8.
Figure 3-8 – An example of how data clustering creates a tree structure. Patterns such as “Blackberry
– Bold – 9900”, “Blackberry – Bold - 9800”, “Blackberry – Q”, and “Blackberry – Z” were merged to
form a tree structure due to their common parent “Blackberry”. This allows future patterns to be
appended to this same parent while also allowing for specific classifications or superset classifications
to be analyzed.
32
This structure allows for branches of tree to be analyzed for specific patterns,
which may be unique to specific classifications. An example of this would be, the
analysis of superset such as “Bold” may indicate these specific models have a unique
characteristic that is not displayed in other models, as such this rule can be applied to just
this portion of the hierarchy. It may that these models are only released a specific area or
are exclusively locked to a specific network, and thus sales outside of this area or on
different networks may be suspicious.
3.3 Rules and Rule Database
In this section we will define and derive the rules that are initially provided to the
rule database, discuss how the rules can be implemented, and how the rule extraction
process functions.
The following assumptions are based on the tactics used by Portland Oregon’s
Burglary Taskforce Unit to identify reported stolen property online, and as such it
assumes that the thief or low level fence is operating according to these assumptions.
These tactics may not be appropriate for career criminals however it should be noted that
the framework itself is flexible and can be extended to include other characteristics
confirmed to be indications of stolen property. For simplicity we will continue to discuss
our work based on the assumption that these characteristics are correct. The assumptions
are as follows:
The date and time of the posting will be relatively close to the time the item was
stolen.
The seller is in close proximity to where the item was stolen.
The item is being sold for less than the average market price.
These three characteristics will be consistent for most stolen posts for the following
reasons; the thief will attempt to sell the item after stealing it as we assume the theft was
opportunistic and not premeditated, the thief will steal property in a reasonably close
33
proximity to where they reside due to its ease of actability, and to dispose of the item as
quickly they will sell it for less than its current market value. These identifiers are
heavily utilized during the active searching process when comparing a post to a reported
stolen item, such as the date/time of the posting are relative to the time the item was
stolen and the sellers proximity to where the item was stolen both are relative attributes
and rely on reported information. That is to say we cannot quantify or extract
suspiciousness patterns solely based on the time or location of the post selling the item.
We refer to these three identifiers as primary attributes due to the strength of their
assumptions.
Additional elements may be used to identify suspicious items but are generally
harder to automate. These characteristics are as follows;
The post is using a stock photo or photo from another post for the item they
are selling.
The seller either negotiates or displays the desire to sell the item away from
their residence.
The post contains a poor description of the item and the seller does not appear
to have much knowledge of the item.
The post indicates that the seller is overeager to sell the item.
The post contains a telephone or contact information for the seller that is spelt
out or obfuscated rather than in plain digits.
These identifiers, including lower than average market price, are quite passive
characteristics which do not reference a reported stolen item. Due to the fact that these
identifiers are independent of reported cases we refer to them as secondary attributes
since they cannot be used to correlate a post to a reported case, however they can be used
to narrow down the set of posts required to be search in the searching process. The
reasoning behind these attributes will be discussed in more detail in section 3.3.2.
34
3.3.1 Application of Primary Attributes
As previously mentioned we must make a set of assumptions on how the thief will
operate based on the documentation that is available, validation of the accuracy of these
assumptions will be left for future work. In this section we will discuss how these three
primary attributes can be used to indicate whether a post is suspicious.
3.3.1.1 Date and Time
It is reasonable to assume in most cases the theft will be opportunistic and not
premeditated, the thief may research the average selling price beforehand but it’s unlikely
they will create a post for the item before the theft occurs. Because of this we can assume
that the thief’s post will be created after the time of the theft and will likely occur with
very little delay. Although pre-selling the item online before its stolen reduces the amount
of time the thief must hold onto the item, it draws additional risk since they must acquire
the item in a much shorter timeframe as the buyer assumes the seller has the item on
hand. The event of pre-selling an item online would likely only occur for very
identifiable and risky items, in which case it would eliminate the time the thief held onto
the item. However due to the complexity and narrow application of concept we will
ignore this case.
This attribute would measure the difference in time between the time of theft
indicated by a reported case and the time the post was created. It should be noted that a
variance of 12 hours should be incorporated to accommodate inaccuracies in the reported
time of theft; although the error in reported time of theft should be much less than this,
this servers as an upper bound for the margin of error. This feature would be weighted by
an exponential decay, with a half-life relative to the average time of sale. The average
time of sale could either be provided as initial domain knowledge or extracted from
previously posts as the amount of time a post was active. Although the latter method
doesn’t directly indicate the item was sold, it will either indicate the item sold or the
seller gave up on using this medium to sell their item. In either case these same
characteristics can be expected when then thief attempts to sell the item.
35
Weighting this value with an exponential decay assumes the risks of holding onto
stolen item increases over time, and thus that the thief will attempt to sell the item as soon
as possible. This would make it very unlikely to see the item being sold after 1-2 times
the average selling time, and because of that it can be assumed the item is already been
sold.
This is to say that even if a perfect match was found, if the post has been up for an
extended period of time then it’s unlike the item would still be in the possession of the
seller, and as such it’s unlikely the item would be recoverable at this point. Although this
is an unfortunate occurrence the system should be prioritizing both confident and realistic
leads that can result in item recovery.
3.3.1.2 Proximity
It is assumed that the seller will not transport the item a far distance, and
conversely will not travel a far distance to steal an item, such that the thief and the victim
will exist within relatively close proximity. However it was stated by the Oregon’s
Burglary Taskforce Unit that thieves would often attempt to sell the item away from their
residence or in adjacent cities, because of this we expect the location to be within the
surrounding area.
This attribute would measure the difference in location, however only using city
granularity as more detailed locations could lead to inaccurate trends. This feature would
be weighted by either a linear or exponential decay, having high certainty when within
the same city and lower certainty as we expand to outside the city; a Gaussian
distribution may alternatively be applied for areas surrounding the seller’s location since
it’s likely they could have stolen the item from one of these locations but beyond the
immediately surround cities this probability quickly decreases.
36
3.3.1.3 Price
It is assumed that the risk of holding onto an item for a prolonged period of time
is not worth the minor increase in price that could be gained, and thus it will be preferable
to dispose of the item as quickly as possible. To address this, the price attribute measures
the difference between the posts price and the average market price of the item. The
average market price and standard deviation can be extracted from clusters of posts with
the same classification. This allows for an estimated window of the current market price
of the item and posts with a price outside of this range are suspicious. Only the suspicious
prices that fall below the average need to be looked at, since suspiciously high prices
would contradict our initial assumption; although this case may be interesting we chose
to leave it for future work since it wouldn’t represent a large amount of stolen items.
The suspiciousness of the posts price can be defined by 1 − 1
𝑘 |∆𝑝| where ∆𝑝 =
|𝑃𝑀 𝑎𝑟𝑘𝑒𝑡−𝑃𝑃𝑜𝑠𝑡 |
𝑃𝑀𝑎𝑟𝑘𝑒𝑡 is the variance in price from the market average, allowing all posts to have
suspiciousness but only relative to their difference from the market average, an example
of this can be seen in Figure 3-9. K represent a domain specific designed to conform the
suspiciousness of the posts to that standard deviation from the market average.
Figure 3-9 – Price histogram of Blackberry Bold 9900 posts with their respective suspicion.
37
As the market price changes over time, an acceptable timeframe in which posts
can contribute to the average must also be defined, such that it isn’t inflated by previous
higher prices. For sites such as Craigslist, they restrict retrieving posts more than 6
months old, which is an acceptable timeframe for calculating a stable average market
price.
The standard market price, also often referred to as the market average, can be
calculated in a variety of ways. Although simply averaging all the prices of the posts
under a specific classification yields a value similar to that we’d expect to see, it is often
substantially lower. As a result more advanced methods must be used, however the
average price previously calculated does give us an idea as to where we should begin to
search for a true market price. As we will discuss in Section 3.5.3 creating a histogram of
the post prices allows for us to identify specific curvatures that indicate the true market
price.
3.3.2 Application of Secondary Attributes
These attributes are often much harder to quantify and apply, however they
provide a method of pre-filtering the posts before they are actively searched. They also
provide methods to infer suspicion, similar to the subconscious or “gut instinct”, and as
such should be explored. Although these attributes are much weaker indicators then their
primary counterparts, a single secondary attribute does not provide the same degree of
suspicion as a primary one; many secondary attributes in conjunction can strength the
suspiciousness of a post.
3.3.2.1 Stock Photos
This attribute attempts to account for the image that may be provided of an item.
One of the most predominant wants people attempt to identify their stolen item is through
uniquely identifying marks or characteristics, as such we assume that if this is the case
then the seller will attempt to conceal this by either not displaying a photo of the item or
displaying a stock image instead.
38
To check if the image is unique we take a hash of seller’s image (if provided) and
compare it to hashes of popular Google image results for that classification and other
posts images. This will tell us if either the seller has used a common Google image to
display for their item or is using another posts image; either of these is a strong indicator
that at the very least the seller is being dishonest as to the state of the item. This concept
could be extended to comparing features within the photos in attempts to find unique
characteristics of the stolen item however this goes beyond the scope of this system.
3.3.2.2 Poor Description
This attribute attempts to quantify the seller knowledge of the item they are
selling. We assume that the thief will not have a lot of knowledge of the stolen item,
which will be true for most opportunistic thefts. In some domains this may not be true,
such as common electronics; however this attribute may be more useful in other domains.
To determine if the seller has some knowledge of the item we check to see if the
seller is using a unique description of the item, similar to the manner of hashing their
photo, we can compare the seller’s description to commonly marketed descriptions.
However due to small portions of the description being altered such as the contact
information we cannot use hashing, instead we must use description comparison to find
how much of the description overlaps with other posts. An example of this would be if
the seller simply copies a technical specification in their post then we would consider this
suspicious. This may not always be accurate, specifically in the case of electronic goods
where sellers may often use copied descriptions, however duplicate descriptions in other
domains such as a rental property description would be very suspicious.
3.3.2.3 Eager to Sell
This attribute attempts quantify the seller’s eagerness to sell an item. All posts
will display some amount of eagerness since everyone making a post has the desire to sell
their item. However in comparison to other secondary attribute this attribute is a strong
39
identifier for potentially stolen items; since it’s one of the main characteristics we expect
from a their or fence it should be present in all stolen postings. Unfortunately this
attribute is quite difficult to quantify as it cannot be directly measured like other
attributes, however the seller saying they are willing to reduce their price or explain why
their price is lower than others can be used as indications.
This attribute may be identified from small common phrases such as “want to sell
fast”, “want to sell as soon as possible”, “need cash quick”, etc. Even these few examples
show the diversity of implied eagerness to sell, and could even be more subtly implied in
the tone of the post. However using short 3-5 gram common string we can attempt to
identify this attribute, furthermore these string could be compiled automatically from
suspicious posts. This concept will be explored more in the rule extraction phase as it is
one of the most promising attributes that could be applied in the classification of
suspicious posts.
3.3.2.4 Contact Information
As mentioned by the Oregon’s Burglary Taskforce Unit many sellers may attempt
to obfuscate their contact information by spelling it out or adding format characters to
prevent search engines from effectively clustering their posts. This poses a problem since
this attribute becomes quite valuable when used to cross-reference the seller’s telephone
or contact information with other posts. Checking for inconsistencies such as different
seller names or telephone numbers for posts within the same timeframe is highly
suspicious. Although it is understandable that people move or change phone numbers
over time overlapping information should not be common, an example of why this is the
case would be the fact that telephone companies must hold onto telephone numbers for 6
months before it can be reissued; as such this would be an appropriate window of time to
check for conflicting information. Additionally this can be used to identify the seller and
infer suspiciousness between posts by the same seller, as it is unlikely that a thief will
only be selling a single stolen item. This concept also increases the suspiciousness of
their posts if multiple posts by the same seller have a small degree of suspiciousness.
40
This can be achieved by indexing the seller’s full name (if provided) to their
contact information along with a date range of use, if there are many overlapping ranges
with different contact information then it is highly suspicious. Conversely by indexing
the seller’s contact information to their full or partial name along with a date range of use
will indicate suspicion if the same number results in multiple names for a given date
range. In the first case the full name must be used as any indexing based on only the first
would have collisions with many different sellers with different contact information.
While in the second case a partial name can be used since we expect that a number will
only map to one or two people at a time. Given that local number registries must hold
only a number for 6 months after it is released, it can be expected that there will be very
few instances where the same number maps to multiple legitimate sellers.
Now that we have an understanding of the rules that can be used, the rule process
as a whole can be generalized as an iterative probabilistic model matching each of the
characteristics defined by the user input, giving us a list of most probable to least
probable matching postings, and in the case of passive detection ranking the most to lease
suspicious posts. Although we have attempted to incorporate many rules into the rule set,
it is far from complete; however the reporting system can serve as a hybrid of
reinforcement learning and training set based on confirmations by the police as to
whether the post matched that of the reported case. It should be noted that we chose to
use a database approach for the rules to prevent portions of the code from needing to be
updated as the rules changed; as such we also offset the rule calculations to the rule
database. This allows for the external storage of rules and threshold values, making the
system slightly more scalable should multiple instances of the framework be deployed.
3.3.3 Rule Extraction
The rule extraction process occurs after the posts have been classified into
clusters, processing specific clusters in attempts to identify new patterns for that cluster.
This is intended to extract new patterns based on suspicious clusters that can be used to
41
identify other aspects of suspicion; however it can also be applied to legitimate clusters to
extract further metadata about the cluster. One of the approaches to determine the seller’s
eagerness to sell was searching for n-gram phrases that were common amongst all
suspicious posts, excluding the keyword patterns that were used to cluster these posts to
begin with. However a more relaxed approach of simply putting the suspicious clusters
back into the classification system could also produce interesting results; removing the n-
gram require would allow for catchy keywords or other aspects to be discovered.
3.4 Reporting Database
This system will be covered in greater detail in Chapter 4 however we will present
an overview of the system functions and requirements. The reporting database serves the
function of allowing the public or police authorities to submit the stolen reports into a
linguistically fixed database. This removes the issue of having to determine the context
and various other aspects of processing natural language before they can be entered into
the database. Users would be asked to select or fill in the information regarding their
stolen property, such as stolen item, make, model, color, etc. The input structure would
be rigidly controlled, leveraging domain knowledge of the target domain to prompt for
specific pieces of information such that there is no variance in the location of the
information and ensuring that no details are missed. The systems input structure would be
imported from domain expertise for each domain but could later be extracted or updated
directly from the classification system’s tree structure.
This concept is similar to Shiri et al.’s work on Thesaurus-enhanced search
interface in [16] leveraging domain information structure to iteratively present
information to the user such that logically progressed to the users desired target. However
their use of thesauri to assist in the searching process is difficult for this application since
domain specific thesauri would be required but are very difficult to find. As such we
would either be required to manually create them based off domain specific knowledge or
leverage machine learning to dynamically create and maintain it.
42
However since the system is using controlled fixed-field input from the reporting
system to compare against the natural language contained in the posts, there must be
some flexibility to deal with different ways people can describe things. An example of
this would be people may often use abbreviations or inference based on surrounding
context, such as “BB Bold” or “selling IPhone” without explicitly stating the
manufacturer. This concept can be extended to future linguistic abbreviations that aren’t
known of or haven’t developed yet. To address this we can leverage the semantic
relations between nodes in the domain knowledge tree to infer missing pieces of
information. An example of this would be a post containing the classification of “Bold
9900”, this classification displays a high degree of overlap with the classification of
“Blackberry Bold 9900”, and as such we can infer that the brand would be “Blackberry”.
From this inference we can also learn new linguistic abbreviations as the data is
processed and alias searching to include those keywords as well. For example if “BB”
was detected as an abbreviation for “Blackberry”, then users looking for “Blackberry
Bold 9900” would result in searches for “Blackberry Bold 9900”, “BB Bold 9900”,
“Bold 9900”, etc.
This aliasing or semantic information can be extracted during the classification
process. Should two classification patterns display large amounts of overlap, say 80-90%,
then we can assume the two classifications are related. This holds true for a majority of
cases since upper portions of the classification tree will overlap for child nodes of the
same parent node, and thus their relation is the parental node. However we cannot
guarantee that the differing portions can be aliased or allowed to be traversed during the
comparison process until other attributes are examined. Given the following classification
pattern examples “Apple – IPhone – 5 – 64GB” and “Apple – IPhone – 5s – 64GB”,
although these two classification patterns share a large degree of overlap in their
corresponding keywords we cannot alias “5” and “5s” due to these being different
models. This can be confirmed when comparing the average market price and standard
divination for each. However if we introduce the pattern “Apple – IPhone5 - 64GB”
although this shares less similarities with the previous patterns when comparing the
market price and standard deviation with that of the posts selling the IPhone 5, we would
43
see that the two classifications have a large overlap in price. This fact along with high
overlap within the parent nodes would be a strong indication that the two classifications
can be safely aliased. This allows the framework to handle various formatting methods
that users may use to indicate their item in posts, including potential spelling mistakes or
abbreviations. Additional measures could be introduced such as regular expressions
applied to each classification to determine if the differences are just formatting characters
such as spaces or dashes.
44
3.5 Experiments
In this section we will discuss a series of experiments preformed in attempts to
measure and validate various components of the system. These experiments will vary
from quantifiable tests such as accurately classifying posts to unquantifiable tests such as
classifications used in rule extraction. We also describe the performance analysis of the
system and how various parameters can be tuned based on the desired results for a given
domain.
3.5.1 Experimental Environment
In our experimental environment a workstation running the proposed framework,
as shown in Figure 3-10, is used to retrieve ads from various online classifieds websites.
The proposed system is developed in C++ with Microsoft Visual Studio 2010 on run on a
workstation with an AMD FX-8350 3.0 GHz process and 16 GB of RAM; utilizing the
Boost[17] library for regular expressions and lexical casting, and the GSL[18] library for
polynomial curve fitting. The primary source of ads were taken from Craigslist “mobile”
section [19], it’s estimated that 200,000 posts were analyzed between February and May
in 2013; getting all posts from Craigslist Canada and US sites. Initial experiments were
Figure 3-10 - Diagram of process of the framework undergoes while determining suspicious posts.
45
limited to a single site due to having to strip off irrelevant html code that could
potentially cause problems during the classification process.
3.5.2 Experiment – Online Post Classification Accuracy
This experiment is to test the classification pattern extraction accuracy which
serves to later classify the posts, thus the accuracy of the extraction directly relates to the
accuracy of later classifications.
3.5.2.1 Single classification
In this experiment the intent was to extract the first keyword from a given set of
posts that would best describe the set of posts (i.e. the keywords that occurred in a
majority of the posts). We refer to this test as non-uniform due to its distribution over
cellphone models is neither balanced nor natural; the training data only contains a single
model of cellphone. Looking at roughly 300 cellphone posts on craigslist, we can see
from Figure 3-11 below that the keyword that occurred the most was “IPhone” since most
of the differential evolution population converged to this point. While other spikes are
introduced by “noisy” transition words such as “With”, “Or”, and “To” which are
substantially reduced by the inclusion of the global set but still indicate a significant
problem. The depth of the diagram indicates the generation with respect to differential
Figure 3-11 - Non-uniform distribution of training set posts; showing optimum occurring for
keyword “IPhone”, with noise around words “with”, “or”, and “to”.
46
evolution, and from this we can see that as expected many major trends were not
discovered until much later in the convergence.
Continuing the experiment with a different dataset containing Blackberry models,
we found that as the keyword requirements expanded the accuracy of the classification
patterns increased. As we can see below in Figure 3-12, after the 3rd iteration the highest
fitness keyword pattern was “Blackberry – 9900 – Bold” which is enough to identify the
posts. It should be noted that although this dataset should have only contained a single
classification pattern, the results returned by Craigslist contained multiple models.
Because of this we can see based on the fitness values, “Blackberry – Bold – 9900” is a
very strong trend while “Galaxy – S3” is a much weaker trend. Although “Galaxy – S3”
is a legitimate pattern, due to their being stronger classification patterns in the third
iteration as indicated by the fitness values, this weaker trend is effectively ignored at this
time. In later iterations when posts classified by “Blackberry – Bold – 9900” not included
in the dataset this trend would be more visible. These weaker trends can exist for a
variety of reasons when a post contains multiple models, the most prominent reason for
this is simply to increase the number of search results that return this post. Slightly more
legitimate reasons are when a post is attempting to sell multiple models or references a
new model for the reasons why they are selling their cellphone.
Figure 3-12 - Highest fitness classification patterns for three iterations; in the first iterations we can
see a lot of noise but in subsequent iterations this noise is substantially reduced as the legitimate
“Blackberry” trend extends.
47
It should also be noted that currently keyword order does impact the fitness of the
classification, although this is often very minor it is the result of the fitness function
weighting the fitness of the current iteration against the previous or base iteration. An
example of this is the fact that during the second iteration the pattern “Blackberry –
9900” had a very high fitness while “Bold - Blackberry” had a much lower fitness,
resulting in “Blackberry – 9900 – Bold” having a higher fitness then “Blackberry – Bold
– 9900”. Although this is irrelevant from a classification perspective it prevents very low
fitness words from appending true keywords and producing comparable fitness values.
3.5.2.2 Multiple classifications
We also attempted to classify datasets that had multiple classification patterns
within them, in this case a uniform mixture of posts from “Blackberry” and “IPhone”. In
Figure 3-13 we can see that both patterns are present. Similar to the previous experiment
we can see that there still exists some noise but there are two clear major trends,
“Blackberry” and “IPhone”. However this result is different from the previous
Figure 3-13 - Uniform distribution of training set posts; showing optimum occurring for both
“Blackberry” and “IPhone” keywords.
48
experiment as the fitness of the “IPhone” trend is comparable to “Blackberry”, and
although the “IPhone” trend is weaker it wouldn’t be interpreted as noise due to having
sufficient representation in the dataset.
Posts that exist in multiple classifications may have equal membership to each,
and thus lower the confidence of the membership and subsequent “suspicious” criteria.
This likely explains why the strength of “Blackberry” and “IPhone” trends were not
equal, since some IPhone posts likely also contained the “Blackberry” keyword.
If we look more specifically at some of the posts within the dataset we see some
that contain multiple classification patterns, such as Figure 3-14, which contains the
patterns “Blackberry – Bold – 9900” and “BB – Bold – 9900”. From the systems
perspective these are different classifications and would be treated as if it contained two
completely different models within the post. However if this information is actually an
alias as it is in this example, we would expect to see large overlapping portions within the
classifications. This information can be extracted if we look at “Blackberry – Bold –
9900” and “BB – Bold – 9900”, due to the large overlap in the later parts of the keyword
chain “Blackberry” and “BB” can be aliased. This aliasing information can be applied
either directly on the posts via per-processing or afterward by mapping the two terms in
an aliasing database and used by the reporting system.
49
3.5.3 Experiment – Price Extraction and Market Price
In this series of experiments we attempted to extract the price from clusters of
posts, prices were extracted from the posts using a regular expression searching for a
monetary symbols such as “$” followed by a series of numbers. Using the same dataset
we manipulated the gradient of the price histograms in attempts to extract information
about the structure of the data. In Figure 3-15 below we can see a price histogram for the
classification “Blackberry Bold 9900” which contains a few interesting characteristics,
the average falls around the value of $210 which very steeply falls off around $150. We
can also see that there are more posts with slightly lower than average prices then slightly
above average prices, this is expected due to the fact that undercutting other posts prices
will happen in this type of market. There also exists a peak at the lower end of the prices,
which is likely due to accessories or services being advertised for the product.
Figure 3-14 – Example of post containing aliasing information for “Blackberry” and “BB” based on
extracted classification patterns. Figure 3-15 – Price histogram of Blackberry Bold 9900; indicating the average price is around $210,
with more posts having a slightly lower price than that.
50
Looking at the data again with a smaller gradient in Figure 3-16 we can see that
the similar trend exists, as outlined with black there are services being advertised at the
low price range and there is a slightly slanted curve around the average from price
undercutting. However sections highlighted in red indicate areas that are highly
suspicious, they exceed the trends expected. The first is a very high peak around the $100
mark, which could either indicate higher end services or indicates very low prices for
selling the item. This also occurs around the $150 price point as well.
In either case we see 4 major types of prices, the $1-70 range are expected to be
garbage data that hasn’t been filtered properly into sub-classifications, advertising
peripherals or services such as batteries or repairs. The $70-140 range is the suspiciously
lower priced posts. The $175-240 range is assumed legitimate posts, with the possibility
of searching the lower bound with slightly relaxed suspicion. And finally the $275-400+
range which is above the market average, which do not need to be searched since it is
unlikely that a stolen phone would be being sold above the market value.
0
20
40
60
80
100
120
140
160
10
40
70
10
0
13
0
16
0
19
0
22
0
25
0
28
0
32
0
35
0
39
0
45
0
48
0
53
0
58
0
78
0
10
00
Nu
mb
er
of
Po
sts
Price
Blackberry Bold 9900 Price Histogram
Series1
Figure 3-16 – More granular price histogram of Blackberry Bold 9900; identifying suspicious
peaks at the $100 and $150 ranges.
51
Finally looking at the Figure 3-17 below with a slightly more coarse gradient we
can see that the average price is surrounded by very sharp declines which gradually
increase between the ranges of $175 and $230, following by another sharp decline. This
characteristic is interesting as the edges of the average range are also met with sharp
edges or bounds occurring at the $175 and $250 price points. Although this could simply
be disjoints in the data it could also be clear indicators of the bound of the average, and
from this what data falls below it.
Excluding the upper and lower 20% of prices resulted in a price histogram in the
figure above, with an average price of $182.50. Although this average is slightly lower
than the true average of the phone, it is still close enough to the $200-210 peak to identify
that section of the graph. As previously stated this was expected due to including the
“suspicious” prices into the average calculation. Although these experiments only
manipulated the gradient of the same data, each produced a unique result which is useful
in the market average price extraction process.
Figure 3-17 – Blackberry Bold 9900 Price Histogram resembles that of function |𝐬𝐢𝐧 𝒙
𝒙|; also
displaying clear boundaries at the edges of suspicious pricings.
Blackberry Bold 9900 Price Histogram
52
3.5.4 Experiment – Rule Extraction on Suspicious Clusters
Before beginning the experiment of testing the rule extraction process on
suspicious clusters, we verified that there were posts in the suspicious clusters that were
both suspicious and contained desirable patterns that we would want to extract. Below are
two examples of posts that indicate there are more complex patterns to identify, the first
in Figure 3-18 demonstrates the persons desire to sell the item quickly by the use of
“ASAP”. Additionally the use of “cash pick up” could also be an indication that the item
is stolen, alternatively it could be an indication as to why the price is lower as they aren’t
willing to travel at all to sell it. While the second post in Figure 3-19 demonstrates much
more complex patterns that we would want to identify, firstly they are using a stock
image, and secondly they require meetings in person. Although they provide extensive
information regarding the item they don’t provide any external method of contacting
them. Although this is largely speculation as to what these potential indicators could
mean, it is more important that the system be able to recognize these patterns.
Figure 3-19 – Post of suspicious cluster that
uses a stock photo and requires that the
transaction occurs in person.
Figure 3-18 – A post from a suspicious cluster
that has a low price and indicates that only
“cash pick up” is acceptable, along with the
indication that they want this transaction to
occur as soon as possible.
53
This experiment was conducted on the suspicious cluster of “Blackberry – Bold –
9900” posts which contained roughly 100 posts. Unfortunately due to the small training
size no useful patterns were extracted from the set. For domains that lack sufficient
suspicious posts, in order to continue the experiment a fundamental assumption would
have to be made in order to increase the training set size, assume that the patterns among
various branches or even in a domain are consistent. If this assumption is true we would
still expect to see domain specific trends become undistinguishable from noise, but
expect to see broad domain trends emerge. If this assumption is not true then it’s likely
that no trends will emerge and would require us to further analyze each domain
independently once larger suspicious clusters are obtained. Due to our lack of suspicious
data in other domains we attempted the same process on another domain.
This experiment was also conducted on the suspicious cluster of “IPhone – 4s”
posts which produced similar results to that of the suspicious cluster of “Blackberry –
Bold – 9900” when attempting to extract patterns from the clusters as a whole. However
analyzing the cluster divided into the following price ranges; $110-135, $130-165, $165-
190, and $190-210 produced better results. Each of these price ranges contained 50-150
posts and reflects individual portions of the price distribution that are suspicious, as can
be seen below in Figure 3-20. It should be noted that the average price was roughly $200
and because of that the price range of $190-210 is acting as a control to compare the
other extracted trends against.
0
20
40
60
80
100
Nu
mb
er
of
Po
sts
Price
IPhone 4s Price Histogram
Frequency
Figure 3-20 – Price histogram of IPhone 4s; indicating that the average price was around $200 with a
few suspicious peaks around the $125 and $150 ranges.
54
The results that we were interested in were when the extracted patterns were
forced to follow an n-gram format, attempting to extract small strings unique to their data
set. It should be noted that the other price ranges were included in the global comparison
set for each experiment, so although the trends are weaker they reflect the absolute
differences between these sets.
Price
Range $110-135 $130-165 $165-190 $190-210
Patterns
CABLE -- ONLY
SHAPE
CONDITION
PROTECTIVE -- CASE
LOCKED
AMAZING -- CONDITION
PERFECTLY – BUT
ASAP -- TO
From a summary of the results that can be seen in the table above, we can see that
although postings were being pulled from cities all over North America a common
keyword in all the price ranges was “Toronto”, which may indicate that this product is
limited to Canada. Ignoring patterns that reference this city or commonly listed prices
($120, $150, and $200), we can see that the $110-130 range contains patterns such as
“CABLE -- ONLY” which indicates that components are missing, which explains why
the post would have a lower price.
Taking a look at the $130-165 price range we can see many patterns that reference
“shape” and “condition”, which are referring to the same thing; while patterns such as
“PROTECTIVE -- CASE” are also indirectly indicating that the item has not been
damaged but also attempting to add value to the item.
When looking at the $165-190 price range, we see many patterns referring to
“LOCKED”, which is referring to the fact that the item cannot be used on a different
carrier and thus has limited usability and audience. We also see patterns such as
“AMAZING -- CONDITION” and “PERFECTLY -- BUT” which may be exaggerating
the condition of the item. Interestingly the only occurrence of the pattern “ASAP -- TO”
55
was in this price range, which is the only direct display of the sellers desire to sell the
item quickly and is one of the more predominant patterns in this dataset.
Finally looking at the $190-210 price range, where the market average is located,
there are no unifying trends regarding the extracted patterns.
Although in this case these classifications themselves cannot be directly translated
into rules in their current condition, they can be used in conjunction with NLP techniques
such as sentiment analysis to determine a posts membership to a given price range. This
is useful as it allows for the ability to quantify some previously discussed attributes such
as the posters eagerness to sell.
3.5.5 Experiment – Performance Analysis
In this experiment the intent was to analyze the performance of the classification
system by examining how the systems population would change over time. This
experiment involved running the classification algorithm 5000 times on a set of 4000
IPhone 5 posts in order to determine at which point the systems could assume a correct
classification for the set.
The results which can be seen in Error! Reference source not found. show how
the percentage of iterations that discover either the global optimum “IPhone” or the local
optimum “5”, both of which will lead to the correct classification pattern. The best,
average, and worst lines refer to the margin between the best members fitness in a
population compared to that of the populations average, these are then scaled relative to
fitness of the known optimum “IPhone”. From this graph we can conclude that any
further generations past 90 ultimately doesn’t greatly improve single optima discovery
but simply transitions other members of the population into the discovered optimum.
However achieving this can be done in different manners, given our termination
condition is composed of two components we can either select a very low margin and
low consecutive generations or a very high margin and high consecutive generations. The
56
reason why consecutive generation requirements are in place is account for poorly seeded
data that results in a smaller margin than expected, an example of this can be seen in the
worst case where the margin is 15% lower than it should be.
It should be noted that our termination condition selected was extremely
conservative, 25% margin for 5 consecutive generations’ results in the correct
classification of even the worst case.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 6
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
101
106
111
116
121
126
131
136
141
146
151
CONVERGENCE OVER GENERATIONS
IPHONE 5 UNKNOWN WORST % AVG % BEST %
Figure 3-21 – Performance analysis graph; showing the percentage of iterations that achieved at least
one member of the population to the global optimum “IPhone”; Best, Average, and Worst lines
indicate the margin between the best member in the population and the population average relative
to the fitness of “IPhone”.
57
3.6 Discussion
3.6.1 Ad retrieval
The amount of time spent in the ad retrieval stage is quite significant, and as the
number of input sources increases the performance of the system will degrade, as much
of its processing will be maintaining an up to date database. To address this issue we
looked into leveraging the http status code 304 “Not Modified”, which would allow for
checking if a page has been modified (either search index or leaf pages/posts) without re-
downloading it. Although this has not been fully implemented it would make sense to
simply maintain the connections with the input sites and periodically check for
modifications to the posts along with new posts. This is due to the fact that the posts
could be altered shortly after creation to correct something or add additional information
which would have been missed in the initial download.
Alternatively the files could be re-downloaded and page hash compared to that
stored locally, although this still consumes bandwidth and some processing resources, it
removes the need to fully reprocess the pages.
3.6.2 Price Extraction
One of the main drawbacks during the rule extraction process is that if no
additional trends emerge to support the suspiciousness then the only suspicious attribute
being checked is the posts price. This is due to the fact that other primary attributes
cannot be leveraged at this point since they require a user submitted report to compare the
time and location components.
Due to this newly released items will not conform to traditional price models and
there will not be enough data to extract suspicious characteristics. This becomes a
significant problem if the extracted features change faster than the time it takes to
compile an adequately sized training set such that the trend is extracted. An example of
this would be the price feature, when there was enough posts to classify a new type of
58
item the extracted price would be incorrect due having the items launch prices and the
most recent prices both included in the price calculation. This can somewhat be resolved
when excluding a percentage of the upper and lower price ranges, however it only solves
the issue for posts containing higher prices. An example would be when a newer model
of the current item is released we would expect to see a price drop in the value of the
current model, if these are excluded from the price calculation then it distorts the price to
a higher average then it’s true value, resulting in significantly higher rate of false
positives. This can only really be solved by providing more domain knowledge to the
system about expected or newly released models, as the system would not have enough
posts to properly classify the new model by the time the old models value dropped.
However it should be noted that the price component is modular and can be changed to
account for this in the future should a better technique or price model be developed.
Duplicate posts on multiple sites also introduces issues in the market average
price being bias towards that price (either higher or lower) due to multiple posts with that
price are being taken into account when calculating the market average. This can be
addressed to a limited extent by hashing the content of the post and checking between
sites, but this may introduce errors and delay.
Price inconsistencies within a post have also been noticed, introducing issues
since we must determine which price to use in the market average calculation. For our
Figure 3-22 – An example of possible semantic analysis; determining the correct price by the distance
between the posts classification and the prices listed within the post.
59
work we chose the highest price contained in the post due to noticing that a large number
of posts that contained inconsistencies would have a lower price in the title to entice
prospective buyers to click their post and then have a slightly higher price within the
body of the post. However this is approach is not complete since there were also cases of
stating their original purchase price and then their asking price, which would result in an
incorrect price being extracted. More complex semantic analysis may produce better
results when weighting the distance between the prices to the approximated point of
product classification, an example of this can be seen in Figure 3-22. This weighting may
be inverted due to a majority of posts containing classification information in the title.
3.6.3 Rule Extraction
The rule extraction process assumes that there are keyword trends in the posts that
we would be able to find using n-gram or keyword classification. This assumption could
be wrong and the rules needed may be much more involved requiring tonal and inferred
NLP techniques. This ultimately comes back to the issue that there is no controlled
training and testing set that this model can be compared against. However if there are
simple trends that can be harvested, then it’s possible to generate an initial training and
testing set based on this framework for many domains.
60
Chapter 4
Reporting System
In this chapter we introduce the reporting system, which serves two major
purposes; the first is that it acts as an interface between harvested data from posts and
user’s reported stolen items, automating the searching process when users are looking for
stolen property. This allows the public or police authorities to submit reports of stolen
items into a linguistically fixed database, removing the need to determine sentence
structure, context, or other aspects of processing natural language before they can be
entered into the database. The second purpose while slightly more abstract, allows the
leads generated through the initial suspiciousness classification to become actionable.
Although suspicious clusters may identify key sellers, the authorities may not always
have the resources to investigate these sellers without reported cases of theft. That is to
say, without a user reporting a case of theft to the police and in turn the reporting system,
suspicious clusters cannot be actively investigated until their suspiciousness is
confirmed.
4.1 Design
Building on the designs of previous research in [16], the reporting process from
the user should involve traversing down a hierarchy of input parameters based on the
domain knowledge of the system. These relations are more formally denoted as Broad
Term (BT), Narrow Term (NT), and Related Term (RT), which indicate the relations
between various nodes within the domain knowledge tree. An example of how these
relations would be applied in this research can be seen in Figure 4-1; “Blackberry” and
“Apple” are both types of cellphones and thus have a NT relations with the node
“Cellphones”, inversely the “cellphones” node has a BT relation with “Blackberry” and
“Apple” nodes. While these two way relations are quite simple, there also exists a two
way relation of RT between “Blackberry” and “Apple” nodes since they are both NT of
cellphones. This pattern of relations will often occur at each level of the tree but will also
relate to other distant nodes based on the domain knowledge the tree was built on. The
61
concept of related terms can also point to orphaned nodes, which effectively aliases these
nodes allowing for these related nodes to indicate an alternative context for the input. An
example of this could be the “Blackberry” node being aliased to the “BB” node, which is
sometimes used as an abbreviation. This related information could also be used when the
user is inputting reporting information; giving them the contextual clues that Apple is
indeed the correct manufacturer they want to select based on other related products they
offer. If all relations of two way, then the division of BT and NT directionally is no
longer required and can be represented by a single traversable link relating two nodes.
Thus for the rest of this chapter the denotation of BT or NT will indicate unidirectional
relations while all unlabeled links between nodes will be bidirectional; an example of
this can be seen in Figure 4-2. This concept allows us to define traversal restrictions
within the domain knowledge tree. For example, normally we would want to allow
searching of other sub-branches of a tree to compensate for user input error but this
would not be desired in the case of traversing up to “cellphones” since this is changing to
a different major classification within the domain. There may be rare cases this transition
is allowed but results in effectively ignoring all the users input.
While not all related terms will be aliases for other nodes, they can also indicate
other categories that relate to parent node in other parts of the tree. The best example of
this is manufacturers such as Apple don’t solely make cellphones, so there will exist
multiple “Apple” nodes in subcategories such as Computers, Electronics (MP3 players),
etc. In this case the related terms infers other products that the company makes while not
merging the portions of the tree and destroying the hierarchy. It is also very important
how these relational links are traversed, particularly in the case of related terms when the
related node doesn’t exist within the same branch of the hierarchy. These relations may
Figure 4-2 – Limited annotated relations
between nodes of the domain information tree.
Figure 4-1 – Fully annotated relations
between nodes of the domain information
tree.
62
be best served to give the user context as to which node they are traversing to, such as
giving a list of subcategories of products that the company offers in order to give them
contextual clues.
Given that we now understand how we can use this information we must discuss
how we can retrieve user input as well as populate the nodes within the domain
knowledge tree itself. One of the cornerstone concepts of this system is that we have
complete control over the reporting element. This means that we can have strict control
over how the user inputs information as well as the format of this information. Having
the user input in a fixed format with only selectable options during the reporting process
removes the requirement for natural language processing as well as cuts down the
requirements for thesaurus or alias matching.
When considering what input information is needed from a user’s report, it can be
divided into the following two categories: report information and product information.
Report information would be information that is required for all user reports regardless
of domain, such as date and time the item was stolen, location, etc. Product information
would be domain specific information, where the amount of input and input formats
would be domain dependent, specified in the domain knowledge tree. This allows for the
report information to be formatted statically while the product information can be
formatted dynamically in relation to how the users is prompted to enter information
63
Given the fact there is a variable amount of information that we want the user to input
and that we want them to input as much information as possible, we must structure the
input format such that it is logically presented and not overwhelming to the user. We
prompt the user for product information relevant to a specific node before allowing them
to traverse down the tree. This also allows the domain information to be requested per
node based on the user’s response, while maintaining control of the structure of the input
data. In Figure 4-3 we can see how the user in prompted for more information based on
the previously selected input, each step expanding the required information. User
reported cases can be stored in one table containing both report information, such as date,
time, and location, and product information; specifically an index to what node within
the domain knowledge tree the report references.
Now that we have an understanding of how the user will be prompted to submit
their information, we need to address the issue of how the domain knowledge will be
imported and how much of this domain information will be used to get users input. There
are a few different ways in which this can be done, manually creating the hierarchy from
Root
Appliances
Cars/Trucks
Cell Phones
Electronics
Root
Cell Phones
Apple
Blackberry
Motorola
Nokia
Root
Cell Phones
Apple
IPhone 4
IPhone 4s
IPhone 5
IPhone 5c
Figure 4-3 – Displaying user input process as it traverses down the domain information tree; on the
left is an example of the user interface and on the right displays traversal within the domain
information tree.
64
a domain expert, automatically parsing manufacturers’ websites to extract product
information, or leveraging the tree from the previous classification process. While it is
clearly not ideal to use a domain expert to create the hierarchy due to increased overhead
and required human intervention, in some cases it may be required and would likely be
beneficial for them to review and correct an automated approach. Based on this review
an automated approach could then be improved to address discovered issues, lessening
the systems dependency on a domain expert. However automated approaches
unfortunately form a dichotomy; while parsing manufacturers’ websites we will get an
extraordinary amount of information about the product, it will also result in much of this
information being unused or implicit to the product anyways. In contrast leveraging the
previous section’s classification hierarch will get extract the information on average that
is used in a post, as this is how the classification process is derived, but our knowledge of
the domain is clearly incomplete and additional information may be beneficial in some
cases. These two approaches are the two extremes of information depth, complete vs.
minimal, neither of which is appropriate for the system. While completeness seems like a
better approach we must consider the user entering information into the system, if they
are required to give detailed information about their product that is unfeasible for a user
to know, then this becomes a usability issue especially when the requested information is
inherent to the product anyways.
This usability issue could be addressed by having some input fields be optional,
however some of the control over the inputs is lost with no real gain, as the fields’ people
don’t know would be left blank anyways. This problem is compounded in either case
when user’s attempt to give the system more information about their product then they
actually know; resulting in the users inputting false information that would incorrect
classify the products in the report. Some of these issues can be resolved by creating a
hybrid of the two approaches, requiring user input for fields that are known to the
classification system, while allowing optional input for any additional fields that are
created by the manufactures specifications. This doesn’t address the issue of users
entering misinformation by mistake and will be left for future work.
65
4.2 Domain Knowledge Structure
Domain knowledge about the reporting system can be stored in any manner;
however a database approach was chosen to allow for logical and intelligent queries to
directly access specific nodes. Having the design follow a tree structure along with
having a table for nodes and links provides the additional benefit of allowing for
recursive queries to be built in order to gather entire subsections of tree, while also
allowing for post indexing and searching to be processed within the database. This is
beneficial due to the ability to offset this processing load to another server or cluster of
servers, making the framework modular with respect to the classification and reporting
system; allowing for distributed domain expert classifiers mapping to a common
reporting system.
The first database would store all the report metadata information, specifically
date, time, and location of the theft. Each of these elements would require a unique
identifier to order them consistently within the generated reporting page, along with a
system name, user friendly name, input parameters/restrictions, loaded modules, and
module arguments. Both the system name and user friendly name are required since
metadata about the attribute can be stored within this name, such as UTC or GMT; which
would allow for time zone conversations based on the posters time to consistent time
format used within the database. The input parameters field acts as the source of input
sanitization, being applied both client and server side; such that if freeform is allowed it
is restricted based on conditions specified in this field. Additionally the module name
would be in place for 3rd party module that may be required for interactive elements such
as calendar to select the date rather than confuse the user as to the date format; along
with module arguments that may be required. An example of this can be seen in Figure
4-4 but could easily be extended based on the system requirements.
Figure 4-4 – An example of what the table structure and potential table input for report information
attributes.
66
This may seem unnecessary but similar concepts are required for all domain
information to keep them robust and extendable. A recursive query can be used to
resolve downstream branches of the tree, resulting in the following sub-nodes being
solved along with their respective attributes; this can be seen in Figure 4-5.
This would then be extended to retrieve the domain information regarding those
specific attributes, for example before transitioning down to more specific attributes
related to “IPhone” or “Blackberry”, the user would be prompted to enter the generic
smartphone attribute information before continuing. Although not displayed in Figure
4-5 information regarding the location within the tree can be maintained in the query by
adding a level value along with the parent node information. A visual representation of
the tree can be seen in Figure 4-6 from which the previous tables were designed.
Figure 4-5 – This result displays the input requirements based on which node is selected; for
example the color attribute must be specified if the smartphone node is traversed.
Figure 4-6 – The domain knowledge tree that was used for Figure 4-5.
67
4.3 Searching process & Indexing Classified Posts
Although we’ve previously classified the online posts, they are not classified or
indexed to the same degree as reported cases; as such we cannot directly compare the
reported case to online posts. This is due to the more broad classification patterns that are
derived from the online post clusters being used to classify online posts while more
precise domain knowledge is used to classify reported cases. We must either attempt to
leverage the domain knowledge to further organize the online posts into the same
structure of our reporting systems domain information or we must match the data in the
reporting system to the online posts. As discussed in the previous chapter, the
classification system cannot directly use the domain knowledge tree due to its lack of
flexibility when dealing with online text; as such the extracted classification tree must be
stored separately to maintain the integrity of the imported domain knowledge.
This can be achieved by having all posts that have been classified indexed to the
respective node within the reporting system tree, or if the tree structure is significantly
different between the generated classification tree and the imported domain
information’s tree, then have the posts indexed to the classification tree and have a
mapping table between the classification tree and the imported domain information’s
tree. Perfectly classifying online posts into the domain information tree structure may not
be possible or may result in most posts only being indexed to major nodes such as
“IPhone 5”, which is no better than their previous classification. This is largely due to the
lack of information present in online posts; many distinguishing features are most often
not advertised; however other attributes can be indexed based on report metadata, since
that will be the largest distinguishing factor matching the reported case with posts. We
will check items within the closest proximity and time to the theft, later checking other
related nodes such as other versions of IPhones in the event that the item was
misclassified.
This allows for all sub-classification posts to also be compared, while optionally
moving upstream a limited number of nodes to compensate for user misclassification.
While completeness would dictate we should traverse up to the root node of the domain,
68
this is both impractical and infeasible to ignore both the users submitted data and
classification process to compare the users report to every post within that domain. That
is to say that we must bound the trustworthiness of the users input to an extent; we
should be able to assume they got the manufacturer correct and possibly the model as
well. This can simply be done by lowering the system confidence in potential matches
for each traversal upwards or by explicitly restricting upward traversals with NT
unidirectionality for child nodes beneath key root nodes. An example of a formula that
could be used for system confidence relative to traversals can be seen below, while a
visual representation of this can be seen in ;
Confidence = 100 – 2𝑛 + 3 Where n is the number of transitions currently
done and the maximum number of transitions is 3.
(7)
0
20
40
60
80
100
0 0.5 1 1.5 2 2.5 3 3.5 4
Co
nfi
de
nce
(%
)
Number of Transitions (n)
System Confidence Based on Number of Transitions
Figure 4-7 - A graph of Equation (7) displaying the confidence as the number of transitions
increases.
69
Following the example above, let’s assume that a user reports a 16GB silver
IPhone 5s was stolen in Toronto around 8pm on May 5th. Although the user would also
be required to give additional specifications, for the purpose of this example we will
ignore those inputs assuming that they are not utilized in the search process. The system
would attempt to match this report against posts classified into the same node in the tree
(IPhone 5s Silver 16GB), checking that the date and time of the post are after the time of
the theft (8pm May 5th) and that the post is selling the item in relatively close proximity
to where the theft occurred (in the Greater Toronto Area). As previously discussed the
posts date and time should always be after the time of the theft unless it was
premeditated, however for a majority of thefts this will not be the case. The difference in
the time of theft and the time of posting should also be minimal, as this time increase the
probability of this being the correct match decreases. As for the selling location, it would
not be impossible for the item to be sold in further locations such as Kitchener or
Hamilton but as the distance increases the probability of this decreases exponentially. If
no matches are found within the surrounding area or the confidence of these matches is
very low (ie the distance or time variance is very large) the searching process will repeat
from the parent node, in this case it would repeat the searching process from the “IPhone
5s Silver” node. This process can search all other child nodes or can exclusively search
the parent node based on system settings; as we can choose to inclusively search these
nodes or assume the information that resulted in a different classification, such as color,
is correct and sufficiently excludes them from being candidate matches. This process
would repeat until it reached the “IPhone” node and would terminate after searching that
node because any further traversal upwards would no longer reference any user input.
Upon the searching process ending the results are returned to the authorities, a detailed
discussion of what information can be returned to the user can be found at the end of this
chapter. It should be noted that although the system settings can determine the inclusivity
of other child nodes after an upwards traversal in the domain knowledge tree, the
searching process should be inclusive for all nodes below the node the users reported
cases was indexed.
70
4.4 Updating Domain Information
As new models are released for a given domain, the information in the reporting
system’s domain information tree will become outdated. There will naturally be a delay
between the time the item is released to the public and the time the classification
information is added to the reporting system.
First let’s consider a scenario where domain information is not updated as new
models are released. If we look at an existing manufacturer releasing a new model, such
as Apple IPhone 6, when posts containing this classification are being mapped to the
domain information tree there is no matching child node beyond IPhone and thus those
posts would be indexed to the IPhone node. This causes a lot of structural information to
be lost during the mapping process and results in a large number of posts being mapped
to the manufacturer node or nodes directly below the manufacturer node. This issue
becomes more clear if we look at manufacturer that don’t have a logical naming
convention to their models, which would result in new models being mapped directly to
the manufacturer node. If we consider new manufacturers entering the market, such as
HTC, when posts containing this classification are being mapped to the domain
information tree there is no matching at all within the “cellphone” sub-tree. This would
result in posts being mapped to the “cellphone” node inappropriately and due to the
searching limitation previously introduced; transitions up to the “cellphone” node are
restricted thus making these posts inaccessible to searches.
Looking at these scenarios from the user’s perspective, it will result in the user
looking to select a manufacturer or model that doesn’t exist within the list. Even if we
allow the user to manually define the manufacturer or model they wish to report, which
loses control over the user input, there still exists the problem that the system lacks
domain information for the users desired input. If we simply stop the user input at the
point where the domain information ends, we will have very little information about the
stolen item; however we lack the specific domain information to continue accepting user
input.
71
From this scenario we can see that lacking domain information causes issues but
also introduces the following problems; how can we handle user input for items that
don’t exist within the domain information tree and how can the system determine when
domain knowledge update should be requested. These issues are related but they are
associated with different viewpoints; the first being the user-centric problem while the
second is system-centric problem. To deal with the first issue, the system should be able
to accept reports for items it doesn’t have domain information about, as otherwise the
usability of the system is drastically lowered due to being unable to handle new items.
The system must allow the user to submit their own information about an item that it has
little to no knowledge about. As mentioned previously allowing free-form input from
users is not desired however the system does have metadata about where the item should
be located.
For example a user is attempting to report a new model of IPhone, they will
naturally traverse down the smartphone and Apple branches. Where the user stops is
important as it gives us an upper bound for the scope of parameters we want to look at.
From this point we would want to collect the minimum amount of information from the
user and request more information once the domain knowledge is updated, in a way
handling the users report as best we can with no actual domain information about the
item. A minimalistic approach is more beneficial since it reduces the chance for user
error while also reducing the amount of user input that must later be verified.
An example of how this would be triggered can be seen in Figure 4-8. The user
would be presented with input fields similar to how they were handled previously, but
these fields would be generated using the highest probability of subcategories in the
parent node where they stopped. In this example we would compare all the sub-nodes
and attributes of all the other models of IPhones in attempts to find potentially related
information to their item. This would not work if the user stops very high up in the tree,
say “Apple” node because that would be a new product line with likely unique
characteristics and not a small variation such as model. An example of how the attributes
would be generated can be seen in Figure 4-9. However attribute prediction can be
72
wrong, unless older models attributes are weighted less than newer ones, as with equal
weighting or based on the scale of the weighting the color attribute was incorrectly
predicted in the example above.
To address the issue from a system perspective, knowing when to request new
domain information can be done a few ways; if the system receives a large number of
reported cases for items that it doesn’t have domain knowledge for, it may be an
indication to initiate a domain knowledge update. A slightly more pre-emptive approach
would be as new classification trends are noticed under nodes within the domain
knowledge tree (such as IPhone) an update can be initiated. Both triggers may be desired
due to delay in emerging classification patterns. Domain information can be updated
using a similar approach to how the domain information was input into the system.
While new branches of domain information can be handled exactly as they were during
the initial domain information import, which` would be the case for new product lines;
related sub-branches of domain information can be handled in a slightly more intuitive
manner. Similar to how the user was prompted for derived characteristics of other sub-
branches, this information can be used to filter manufacturer specifications to a
reasonable level such that they are in line with the amount of detail requested in other
nodes.
Figure 4-8 – An example of how users’ would select the “Other” option should their desired category
not exist within the domain knowledge tree.
Figure 4-9 – An example of how attribute prediction would function; predicting the model sizes
correctly but predicting the model colors incorrectly.
73
4.5 Periodic re-checking user reported cases
The final aspect of this system that must be discussed is how often the users
reported cases should be re-evaluated against the classified posts. This can either be done
with an active approach or a passive approach, by either comparing newly classified
posts as they enter the system or by waiting an interval of time before generating a
comparison to the current classified posts.
The largest benefit with an active approach is that it gives real-time results while
also only comparing the reported cases with new posts. The disadvantage of this is that
unfortunately newly classified posts entering the system may be compared with
incomplete domain knowledge at the time. This would result in inaccurate comparison
between posts and reported cases, also introducing inconsistencies in the relative
confidences between posts.
This issue is solved by a passive approach, since it there is a delay between newly
classified posts and when the reporting system runs the comparison process. This
however results in a delay in the comparison, which is undesirable for time sensitive
applications such as selling stolen goods. When considering a passive approach
determining an appropriate amount of time that newly classified posts should be kept
before running a comparison is both difficult and domain specific. Given that an active
approach allows for real-time analysis, it makes more sense to use this approach and
attempt to address the issue of inconsistent domain information during the comparison
process. Although this approach provides real-time analysis it is at the experience of a
slightly increase in the processing overhead and much larger storage overhead; each
reported case must maintain a list of posts that have already been matched along with its
match confidence and domain knowledge version or date of comparison. Storing the
domain knowledge version or date of comparison addresses the issue of using different
domain information to compare different posts; allowing for older posts to be re-
compared as the domain information changes, to maintain consistent comparisons to
reported cases.
74
Although the storage overhead is rather large, reducing it would increase the
processing overhead which is more valuable. Due to this storage overhead it does bring
into question when reported case should be abandoned. Ideally reported cases would find
appropriate matches and exit the system but for reported cases where no acceptable
match can be found in a reasonable time there must be a way for these cases to exit the
system so that they don’t cause unnecessary overhead. As the confidence of the match is
dependent on the difference between the time of the theft and the time of the post being
made, after a long enough period of time even if a perfect match is detected the
confidence would be very low; potentially so low that its undistinguishable from noise or
extremely poor comparisons. Although this timeframe may be domain dependent, it
would be reasonable to assume that after 2-3 months if the case has not yielded a
reasonable match then it is unlikely it will. Additionally any inferred information about
the relationship between the post and the theft degrades over time, such as the proximity
can no longer be guaranteed. At this point it can be concluded that; the item was stolen
for personal reasons, the item was sold in a different medium, or the item was sold online
and the system was unable to identify it. Potential reasons the system may have been
unable to identify the matching post could have been due the item being sold on a site
not referenced by the framework, the lack of domain information at the time of
comparison, the lack of parameters that would have identified the match, or after
correctly identifying the matching post authorities were unable to pursue the lead.
75
4.6 Experiments
4.6.1 Experiment – Active Search using the Reporting Database
In this experiment we wanted to determine how much of a match could be made
with reported cases of stolen property. As such we must construct reported data, and
either explicit or derived domain knowledge about the topic; for this experiment we will
use explicit domain knowledge of various phone brands, models, and specifications. This
information is used to compare the percentile match between the reported data we have
against the set of posts; lowering the systems confidence should the post contain multiple
classification patterns. This comparison will be done by a simple keyword matching to
reduce the complexity however further NLP techniques could be applied to improve the
accuracy of the comparison; additionally a 100% match enforcement policy is used to
reduce the returned results. For this experiment we simulate the reported data by using
the following stolen property:
Type Manufacturer Model Family Model Carrier Color
Cell Phone Research in Motion Blackberry Bold 9900 Bell Black
The results below in Figure 4-10 are prototype output from matching posts from
the criteria above, referencing the local copies of the posts and not ranking them by
percentage match. On the left hand side we can see the extracted price from the post,
followed by the location of the stored post.
Figure 4-10 - Results of the search process; displaying a rudimentary match with the posts price on
the left followed by the local location of the matching post on the right.
76
Based on these returned results a human would need to analyze these posts to
determine if there was enough of a match, however this list could further be reduced by
incorporating other primary and secondary attributes. At this point the system would
need to be refined like any AI system, by recording which results the human determined
were correct matches and which were incorrect matches. It should be noted that although
all the posts may match there will still be some decision criteria that the system is
unaware of, and thus requires a feedback loop in order to help identify these.
4.7 Discussion
Another issue that we discovered is to what extent information should be
presented to the user submitting the reported case. Although law enforcement agencies
strongly advise people not to attempt to recover their stolen property without the
assistance of the police, occasionally people still do attempt to forcefully recover their
property. As such what information the reporting system displays is important such that
it does not act as a conduit for people to track down their stolen items when we cannot
guarantee these are the matching stolen items. Thus there must exist the balance between
presenting the user with enough information such that they can confirm a match for the
police while not allowing them to contact the seller or retrieve the original post.
Obfuscating the posts content before displaying any results to the user would be difficult
but required. Displaying any unique images to the user would be ideal since it can easily
be used to offload the refinement process from the authorities to the users; however
searchable unique characteristics such as name, contact number, identifiable spelling
mistakes or string must be either obfuscated or redacted prior to being displayed. This
topic would require additional research beyond the scope of this thesis; as such it will be
left for future work.
77
Chapter 5
Framework Portability and Applications in Other Domains
In this chapter we will discuss how this framework can be applied to other
domains and what rules must be altered to achieve similar results in these domains. The
two domains we will look at are rental property scams and metro pass scams.
In order to understand how these domains relate to the previous work and why
extending the framework to these domains is important we will look at a couple of
examples. First if we consider a person looking for a vacation rental property, they would
begin by looking for a rental properties in their vacation destination. They may use a
reputable site and look for reputable areas within the city but ultimately they must contact
the person posting the advertisement either online or by telephone. They may ask for
additional information regarding the property and should both parties be satisfied with the
agreed rates, the owner would most often ask for a deposit for reserving the property for
the given time. Up to this point the prospective renter has assumed the person they
contacted is the legitimate owner, they may have even asked specific questions in
attempts to confirm this, however they will not be able to confirm this until visiting the
property which would only occur at a later date. This scenario is the essence of the scam,
the prospective renter must give the contact a deposit before being able to authenticate
them as the legitimate owner.
The second example we will look at is a potential metropass scam where a user is
looking for a metropass at a reduced rate. First we must consider how a legitimate seller
would attempt to advertise their metropass; they would likely have a reduced price in
comparison to the retail market price in attempts to sell the item in order to recover some
of their initial investment. They may be very vocal of their eagerness to sell the item due
to its time sensitive nature, since its value will decreases over time and has a fixed
lifespan. Advertisements for such items would also be very non-descriptive due to simple
and implied information regarding the item. From the buyers perspective they cannot
authenticate the item until they purchase it and attempt to use it. It is reported that fake
78
metropasses cost the Toronto Transit Commission (TTC) close to 2 million dollars in
2012 [20], this shows the increasing problem of scams in our society.
Both of these examples share the fact that an initial investment must be made
before the item or property can be authenticated. These two domains also have the
interesting characteristic that no one is actively searching to report these posts and
reporting these posts is much more difficulty and at the discretion of the site. While
people may attempt to report rental property scams for their properties it’s rather
cumbersome and most often involves simply using Google to search for their address;
this must be done periodically and doesn’t guarantee that it will always catch every post.
Similarly metropass scams are often not reported due to transit provider not having the
man power to actively search for these scams and fraudulent costs are most often offset in
the base ticket price or the consumer buying the scam metropass should it not work. This
makes these domains and domains that are easy to scam very lucrative targets with no
fixed end point; that is to say we must constantly be searching for newly matching posts
in order to report them and no cases can ever be closed.
Because scams are such high reward for very little risk it is not surprising that
there are a large number of scams for any given domain, as such fitting this framework to
detect scams is valuable both from a social standpoint as well as criminology standpoint.
Monitoring suspicious post activity, especially if the advertising website are involved,
can result in criminal profiling based on linguistic analysis, post modification, and
movement within their site.
One of the main objects of this research was to develop a domain independent
framework that could easily be translated into any domain with very little effort. This is
one of the main reasons we’ve attempted to avoid having a heavy dependence on domain
knowledge. This allows us to apply the approach to many other domains, notably scams
in which the buyer or the manufacturer are the victims. Some of the assumptions used to
detect potentially stolen items are no longer valid such as the eagerness to sell the item
may no longer be present due to lower risk since the item is not stolen. Two domains that
79
we want to test are metropass scams and rental property scams. While the metropass
scams follow very closely with the previous framework with respect to fixed description
of the item, the rental property scams are far more categorized then the previous
framework. We will not be discussing the complete framework in this section but instead
how the previously described framework would be modified to tackle the target domains.
5.1 Metropass Scams
We will first discuss detecting metropass scams since it more closely follows the
structure of stolen property detection. Although this focuses primarily on Toronto
metropass scams it is comparable to any public transit system or fixed service oriented
system.
5.1.1 Ad retrieval
This subsection identifies which sites are good candidates for harvesting posts to
analyze. Similar to the previous framework sites such as Craigslist, Kijiji, etc. are all
good sources for metropass scams since its intended audience is other customers where
the market demand for the given item is not so large that it merits a dedicated site. For
Craigslist the subcategory of tickets and general were the most common classification for
metropasses.
5.1.2 Primary and Secondary Attributes
This subsection identifies the attributes that can be used during the speciousness
classification process. Unfortunately due to the nature of the problem many primary
attributes are no longer valid due to it being implicit to the event of the theft, for example
date/time and proximity are no longer relevant, leaving only the price attribute.
Additionally many if not all of the secondary attributes also no longer apply. Since
condition of the item is irrelevant, photos are unlikely to be included or if they are we can
expect they will be stock images for a majority of cases. The sellers eagerness to sell will
likely also be very unreliable due to the nature the domain, since the potential risk is
80
drastically reduced for the scammer we would expect their eagerness to be on par or
below legitimate posts, due to legitimate posts actually having a potential for financial
loss if there is not a sale within a period of time. Finally duplicate contact information on
multiple posts would be an indication of a potential scam due to the item existing on the
market for a prolonged period of time; however it’s unlikely the scammer will make
duplicate posts since they can simply “sell” the item to every person that contact them
and the item will never change.
5.1.3 Experiment
In our experiment we looked at 40 posts for Toronto metropasses on Craigslist
and Kijiji, comparing the asking price to the controlled market price after adjusting based
on the amount of time left on the pass. A price histogram of this ratio can be seen below
in Figure 5-1 and Figure 5-2.
0
5
10
15
70% 75% 80% 85% 90% 95% 100% 105% More
Nu
mb
er
of
Occ
urr
en
ces
Ratio of Asking Price to Market Price
Metropass Price Histogram
Figure 5-1- Metro pass histogram using 5% intervals.
0
2
4
6
8
10
12
55% 60% 65% 70% 75% 80% 85% 90% 95% 100% 105%Nu
mb
er
of
Occ
urr
en
ces
Ratio of Asking Price to Market Price
Metropass Price Histogram
Figure 5-2- Metropass histogram using 2.5% intervals and extended lower bound.
81
These results were unexpected and differ from other domains due to their pricing
being very close to the market price; due to the lack of data we did not differentiate
regular metropasses from student metropasses since their target market would be the
same regardless. However had this differentiation been done the prices would have been
even more heavily skew towards the fixed market price. We assume that the selling
prices were very high either due to anticipated haggling or simply that potential buyers
may have no alternative. Since this is a transportation service it could be assumed that
potential buyers must use the service regardless and must either use weekly passes or buy
a new monthly pass in the middle of the month, as such more aggressive pricing models
can be used. Due to these results it is clear that price detection cannot be reliably used in
all domains, and in this case derived suspicious keyword strings would likely be required.
Upon reviewing the outlying posts at the lower end, they both appear within the expected
range of a traditional domains market price distribution and displayed no characteristics
of suspicion.
5.2 Rental Property Scams
We will now discuss detecting rental property scams, this domain differs more
from the previous framework and requires more initial domain knowledge.
5.2.1 Ad Retrieval
This subsection identifies which sites are good candidates for harvesting posts to
analyze, however unlike the previously discussed domains that used generic classified ad
sites such as Craigslist or Kijiji, property rental or vacation property rental domains are a
large enough industry that there are dedicated sites for these domains. Dedicated sites
such as homeaway.com and vacationrentals.com can be used in addition to the previously
mentioned vacation rental subsections on classified ad sites.
82
5.2.2 Classification
This subsection discusses how classifications can be derived. For the
classification process to function properly initial domain knowledge is needed and needs
to manipulate the posts so that proper classification can be derived. One of the most
important pieces of information needed is geographical context such as cities in
provinces/states and provinces/states in countries. This information is required due to
contextual implications inherent to the posts, for example users may simply advertise
their address with the implication that it’s in the GTA when posting on the Toronto
subdomain on Craigslist. Although this doesn’t seem that important it is required for
building a hierarchical tree within the classification process.
The second most important piece of information needed is the format of how
these locations are identified, while many other domains can get away with simple
keyword extraction, this domain requires interaction with domain information to know
how to select these keywords. This can be done either with an n-gram selection model
after finding the largest classification of location or by more complex rule sets that must
select an item from each location category.
5.2.3 Clustered Data
Once location knowledge is part of the classification process we can expect
classifications based on country, province/state, and city. However this classification isn’t
precise enough to accurately compare prices due to the large variance is property value in
different areas of a city. Thus we would have very large clusters in which prices would
not be directly comparable due to a skewed market average and high standard deviation.
To address this much more granular clustering must be done before comparing
prices, however this clustering does not need to be done in the classification process.
Additional domain knowledge can be imported to indicate which areas, streets, etc. are
adjacent to one another, providing a graph for average prices and predictive price
gradients to be overlaid.
83
5.2.4 Primary and Secondary Attributes
This subsection identifies the attributes that can be used during the speciousness
classification process, and similar to the previous domain date/time and location
attributes are no longer applicable. However many secondary attribute are applicable,
such stock photos, duplicated descriptions, and contact information.
Since photos give potential renters a substantial amount of information about a
property it is quite uncommon for there not to be a photo of the property in the
advertisement. Due to this fact scam posts must take the images they provide from
somewhere and if a duplicate image is found then we know there is something wrong
with one of the two posts containing this image. The similar problem exists for the
property description, since it’s much easier for the scammer to simply copy a user’s
description then to create a unique one, large duplications in descriptions would likely
indicate that the description is copied. Similar to the stolen property domain, if a user’s
contact information is on a large number of posts then it is an indication that there may be
an issue. This also allows for transitive suspiciousness between posts should the other
posts by the user contain duplicated description or stock images.
5.2.5 Experiment
In our experiment we looked at posts for vacation rentals that had pictures from
Craigslist site for most major cities in Canada. This involved analyzing roughly 6000
images from 800 posts, hashing the images and mapping them to their respective posts;
1130 images were found to be included in multiple posts. Given these numbers we can
tell that most posts do not contain duplicate images however based on the Figure 5-3 we
can see that a majority of image duplication exists between two posts, while higher
degrees of duplication are also present.
84
Many of the examined cases of 2ed degree duplication were simply from users
recreating their posts after a few weeks; this subset is where we would expect to find
cases of vacation rental scams however due to the volume of cases we were not able to
examine all of them. Larger degrees of duplication were most often from companies
advertising their rental property across multiple Craigslist regions, most often for remote
or external locations relative to the advertising location.
Although this experiment doesn’t directly show conflicting listings using the same
images, it does show the vast difference between how users behave in various domains.
Users in this domain frequently recreate their posts, many times on multiple sites and
adjacent locations; as such to identify cases of rental property scams would require
unique image association to specific user which was not done in this experiment.
58.85%20.09%
9.82%
4.69%
3.63%
0.44%
1.33%0.27%
0.18% 0.09%
0.09%
0.09%
0.09%0.35%
Degrees Image Duplication in Posts
2
3
4
5
6
7
8
9
10
11
Figure 5-3 - Percentage of image duplications on multiple posts; the legend indicating the degree of
duplication.
85
5.3 Discussion
Theses domain extensions can be applied in a slightly different manner to the
previous covered topic of stolen property detection and recovery, largely due to their
more abstract nature. In the domain of stolen property detection and recovery the
emphasis is on reactive recovery of the item with limited preventative measures also in
place; where as in the scam domains there is much more value in proactive measures to
prevent the sale. This makes the target audience for the two domains different due to
their different goals, scam prevention mechanisms are much more beneficial for websites
while item recovery mechanisms are much more beneficial to the police. While these are
not mutually exclusive, websites will benefit from the prevention and recovery of stolen
items especially from a public relation standpoint as well as reducing the amount of
customer complains when stolen items or scams are discovered, and law enforcement
agencies will have less reported cases of scams with scam prevention mechanisms in
place. However this approach can be applied in both cases to result in different goals, this
can be applied by websites to warn its users of potential scams, while it can also be
applied by the police to track scammers across domains.
Unfortunately all domains are subject to the arms race between law enforcement
agencies and criminals. This is especially true in domains such as scams, where there is a
very low initial risk and moderate to high payoff, which makes defeating the
countermeasures in place a high priority. From a price standpoint, which is currently a
major attribute in the detection process, there does exist end points in which prices below
or above these points are not detect because they exist within the normal range. If the
seller increases the price to avoid detection then the items price is too close to that of a
reputable provider which would result in the potential buyers choosing the reputable
provider. If the seller decreases the price to avoid detection then the items price is too low
such that people know it’s a potential scam or the scam is no longer profitable in the case
of counterfeit goods. Because of this limitation it is much more likely that technical
components will be targeted for exploitation so that the scams are not classified as
suspicious. Examples of this would be attempting to confuse classification system into
misclassifying the item, obfuscating the decision parameters, etc.
86
Chapter 6
Conclusion and Future Work
In this chapter, we give a brief summary of the work done and future research
directions for this topic.
6.1 Conclusion
This thesis presented a complete stolen property detection and recovery
framework. First we proposed a method of automating the process of collecting classified
ads from popular 3rd party sites such as Craigslist, Kijiji, etc. and process them into
dynamic classifications based on the trends that exist in the data. This process addressed
some of the issues that can occur when attempting to crawl the target sites, introducing
caching and indexing techniques to reduce the overhead of this process. Second, we
designed a clustering scheme that would analyze the posts to see if they should be
flagged as suspicious as well as analyzing clusters of suspicious posts to extract new
trends that could identify suspicious posts with more confidence. This portion of the
proposed framework takes advantage of the amount and nature of text in an online
medium, accounting for spelling and grammatical errors in the text, while still being able
to effectively extract classification trends; while also identifying and addressing the fact
that we know very little about quantifiable characteristics that can be used to identify
stolen items online. Finally we proposed a reporting framework where allows users to
submit reports of stolen items, which would be searched against suspicious posts in
attempts to find their stolen property; results with sufficient confidence would be
forwarded to the police. This approach takes advantage of the structure and information
in the domain knowledge, allowing for user input to be stepwise; while also addressing
scenarios where domain knowledge is missing. Compared to other attempts to solve this
issue, the proposed method is comprehensive, requires less human intervention, and is
modular to the target domain. Initial experimental results show promising results for both
single and multiple classification pattern accuracy, successful identification of suspicious
price clusters, as well as rule extraction of suspicious clusters. We also test the
frameworks portability to the domain of scams, specifically metropass scams and
87
vacation rental scams. The frameworks application in other domains identifies how the
frameworks must be altered to fit to the new domain, as well as the results identifying
fundamental assumptions that no longer hold true of the new domain compared to that of
the traditional selling of stolen good. We display the benefits to both websites and law
enforcement agencies should they implement this framework; increasing the websites
reputability while reducing the amount of customer complains when stolen items or
scams are discovered, while also reducing the load on law enforcement agencies by
having an automated system in place to find the most likely lead.
6.2 Future Work
One of the more predominant directions this research has revealed is the ability to
use an AI model to extract trends from suspicious posts. Similar to how expert systems
are used to detect fraud or other crimes, these systems are often very domain specific and
require extensive training data; however with the current system although it requires a lot
of training data, this can be acquired automatically from online sources. This allows the
system to be more flexible to its target domain while still extracting general theft trends
and domain specific theft trends. This flexibility also allows for trends to be specific to
certain levels of the criminal hierarchy since the identifiable characteristics change from
low-level thieves to professional fences.
Another area of interest is the idea of using reactive market trend analysis to
request or retrieve updates to domain expertise. Considering the issues discussed
previously, market prices can often have very rapid decreases based around new models
being announced or released. This serves as an indication to request or attempt to retrieve
information about new models at this time, thus speeding up the classification process
due to resolving the lacking domain knowledge about new trends.
Additionally the concept of secondary attributes was mentioned but not fully
explored in this thesis due to their difficulty to implement; they tie heavily into the rule
extraction and accuracy of the suspiciousness. Many of these approaches do not require
88
the full application of NLP and constitute some of the human logics that are
subconsciously used in detecting if the item is suspicious. Detecting the seller’s eagerness
to sell was one of the more complex attributes but also one of the more promising and
universal attributes for all domains where stolen items are being sold.
Finally one of the goals of this work was to extend the classification of suspicious
and non-suspicious data set in order to train other classification and detection
frameworks. This can be used to bootstrap expert systems for emerging domains due to
its domain portability.
89
References
[1] Wikipedia contributors, “Fence (criminal),” Wikipedia, May 9 2014. [Online]. Available: http://en.wikipedia.org/wiki/Fence_%28criminal%29 [Aug 15, 2014]
[2] J. Fuller, “How eFencing Works,” howstuffworks, Aug 2014. [Online]. Available: http://computer.howstuffworks.com/efencing.htm [Aug 15, 2014]
[3] “Tor,” Torproject, Aug. 2014. [Online]. Available: https://www.torproject.org/ [Aug 15, 2014]
[4] “Oregon man recovers stolen bike after sting operation,” Fox News, Aug. 2012. [Online]. Available: http://www.foxnews.com/us/2012/08/16/oregon-man-recovers-stolen-bike-after-sting-operation/ [Oct. 1, 2012]
[5] D. Moye. “Kenneth Schmidgall Tracks Down Stolen IPhone, Fights The Guy Who Has It (VIDEO),” Huffington Post, Aug. 2013. [Online]. Available: http://www.huffingtonpost.com/2013/01/08/kenneth-schmidgall-stolen-iphone_n_2433101.html [Feb. 1, 2014]
[6] Stolen911.com, Aug. 2014. [Online]. Available: http://stolen911.com/ [Oct. 1, 2014]
[7] J. Treadwell, “From the car boot to booting it up? eBay, online counterfeit crime and the transformation of the criminal marketplace,” in Criminology and Criminal Justice , Vol. 12(2), April 2012, pp. 175-191
[8] D. Lieberman and L. Effron, “Is You Stolen Stuff on Craigslist? Here’s What to Do”, ABC News, Oct. 2011. [Online]. Available: http://abcnews.go.com/blogs/technology/2011/10/found-your-stolen-stuff-on-craigslist-tips-on-what-to-do-2/ [Oct. 1, 2012]
[9] “Stolen Blackberry Q10”, Stolen911.com, July. 2014. [Online]. Available: http://stolen911.com/category/1063/Stolen-Blackberry-Smartphones/listings/18658/Stolen-Blackberry-Q10.html [July 15, 2014]
[10] “LOST BLACKBERRY SMARTPHONE”, Stolen911.com, July. 2014. [Online]. Available: http://stolen911.com/category/1063/Stolen-Blackberry-Smartphones/listings/19251/LOST-BLACKBERRY-SMARTPHONE.html [July 15, 2014]
[11] E. Brill, “A simple rule-based part of speech tagger,” in Proceedings of the third conference on Applied natural language processing, Trento, Italy, 1992, pp. 152-155
[12] M. Marcus et al., “The Penn Treebank: Annotating Predicate Argument Structure,” In Proceedings of the workshop on Human Language Technology , Plainsboro, NJ, 1994, pp. 114-119
[13] J. Yi et al., “Sentiment analyzer: Extracting sentiment about a given topic using natural language processing techniques,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 427-434
[14] D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” in Lingvisticae Investigationes, Vol. 30(1), 2007, pp. 3-26
90
[15] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” in Information Processing and Management, Vol. 24(5), 1988, pp. 513-523
[16] A. Shiri et al., “Thesaurus-enhanced search interfaces,” Journal of Information Science, Vol. 28(2), 2002, pp. 111-112
[17] “Boost C++ Libraries,” Boost, Aug. 2014. [Online]. Available: http://www.boost.org/ [Feb. 2013]
[18] “GSL – GNU Scientific Library,” GNU, Aug. 2014. [Online]. Available: http://www.gnu.org/software/gsl [March 2013]
[19] Craigslist, Aug. 2014. [Online]. Available: http://toronto.en.craigslist.ca/moa/ [Feb. 1, 2013]
[20] C. Mills, “Fake Metropasses and tokens cost TTC close to $2M last year,” Toronto Star, Feb. 17, 2013. [Online]. Available: http://www.thestar.com/news/gta/2013/02/17/fake_metropasses_and_tokens_cost_ttc_close_to_2m_last_year.html [Aug. 15, 2014]