Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Final Report
Building a Framework for Recommendation Web
Service System with Amazon Product Data
CIS 698 : Independent Study
Presented to
Dr. Sunnie S. Chung
Cleveland State University
By
Sagar Dahiwala
CSU ID : 2689129
1 | P a g e
Acknowledgements
I would like to thank Dr. Sunnie S. Chung for being my advisor and guide. I
am grateful to her for her continuous support and invaluable inputs she has been
providing me through the development of the project. This work would not have
been possible without her support and encouragement.
I would like to express my sincere thanks to Dr. Haodong Wang, MCIS
Program Director for allowing me to work on this topic under Dr. Sunnie S. Chung.
I would like to express my gratitude and appreciation to Mr. Wayne Largent,
Director, Mind Streams LLC for providing flexibility in time and all the way to work
on my project during summer internship.
I would like to thank my parents and all my dear friends who have been
rendering continuous moral support, encouragement, and helping me complete the
task successfully.
2 | P a g e
Abstract
Building accurate recommendation system is game changer in today’s fast-
growing ecommerce industry. There are so many research going on in this field to
develop broad system that can cover major industries.
Currently the recommendation system for is dependent on some specific
attribute related to customer only. However, the large retailer has huge customer data
and attribute. Major challenges are the accuracy of the online algorithm. How
efficient and accurate your system directly affects the final sales of the company.
We start using traditional collaborative filtering approaches by implementing
cosine similarity. Successfully Handle all the possible scaling issues. Enhance the
traditional algorithm by adding the weighted similarity approach.
Building framework for recommendation web Service System with Amazon
Product Data will be a major change in the recommendation algorithm.
3 | P a g e
Table of Contents
Acknowledgements .................................................................................................... 1
Abstract ...................................................................................................................... 2
Table of Contents ....................................................................................................... 3
Introduction ................................................................................................................ 4
Data format ............................................................................................................. 4
Graphical representation of item-to-item collaborative filtering ............................ 6
Implementation .......................................................................................................... 7
Dataset in flat file .................................................................................................... 7
Database .................................................................................................................. 8
Traditional Recommendation Approach................................................................. 9
Algorithm ............................................................................................................. 9
Data manipulation ..............................................................................................10
Related Functions ..............................................................................................11
Final Coding ......................................................................................................11
Item Similarity Weighted Approach .....................................................................14
Algorithm ...........................................................................................................14
Modification into basic algorithm .....................................................................14
Algorithm Results ....................................................................................................16
Current Limitation ....................................................................................................17
Future Enhancement ................................................................................................18
Reference..................................................................................................................19
4 | P a g e
1 Introduction
Most Collaborative filtering based recommender system build a neighborhood
of likeminded customer. The neighborhood formation scheme usually uses Pearson
correlation or cosine similarity as a measure of proximity.
After finding proximity neighborhood, it will produce two types of
recommendations.
1. Prediction of how much a customer C will like a product P.
2. Recommendation of a list of Products for a customer C.
This Algorithm has some limitations,
Sparsity: Algorithm rely on exact matches for neighborhood formation.
Scalability: computation of algorithm grows with both number of customer and
products
Synonymy: Different product name can refer to same product. Correlation based
recommender systems would be unable to discover the latent association and treat
those products differently.
While building the framework for recommendation web system we have taken care
of each limitations of existing system. To start with, let’s discuss about the dataset
we use.
Full information about Amazon Share the Love products. Total items: 548552
1.1 Data format
• Id: Product id (number 0, ..., 548551)
• ASIN: Amazon Standard Identification Number
• title: Name/title of the product
• group: Product group (Book, DVD, Video or Music)
• salesrank: Amazon Sales rank
• similar: ASINs of co-purchased products (people who buy X also buy Y)
5 | P a g e
• categories: Location in product category hierarchy to which the product
belongs (category id in [])
reviews: list of Product review information: user id, rating, total number of votes
on the review, total number of helpfulness votes (how many people found the
review to be helpful)
Let discuss one parsed JSON data set and its field.
{
'asin': '0827229534',
'group': 'Book',
'title': 'Patterns of Preaching: A Sermon Sampler',
'reviews': [
{
'rating': '5',
'cutomer': 'A2JW67OY8U6HHK',
'votes': '10',
'helpful': '9'
},
{
'rating': '5',
'cutomer': 'A2VE83MZF98ITY',
'votes': '6',
'helpful': '5'
}
],
'salesrank': '396585',
'similar': [
'0804215715',
'156101074X',
'0687023955',
'0687074231',
'082721619X'
],
'id': '1',
'categories': [
[
'Books[283155]',
'Subjects[1000]',
'Religion & Spirituality[22]',
'Christianity[12290]',
'Clergy[12360]',
'Preaching[12368]'
],
[
'Books[283155]',
'Subjects[1000]',
6 | P a g e
'Religion & Spirituality[22]',
'Christianity[12290]',
'Clergy[12360]',
'Sermons[12370]'
]
]
}
1.2 Graphical representation of item-to-item collaborative filtering
ITEM CUSTOMER ITEM
1
2
A
B
1
2
3
4
1-3
1-4
1-2
2-1
7 | P a g e
2 Implementation
Before start with actual algorithm implementation. We should understand the
dataset. Here, Amazon review meta dataset is available in flat file. Which was taken
from the Stanford large network dataset collection. Initially we parse single file
dataset to separate files to make it easier for insertion into database. Using the SQL
import Export tool, I imported all the data to separate tables. Major issue during
import is query execution error. Some of the character filtering required due to such
error.
2.1 Dataset in flat file
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based
Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based
Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8
2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4
2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16
8 | P a g e
2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5
2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6
2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8
2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5
2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5
2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9
2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1
2.2 Database
Figure 3 Product Rating
Figure 1 Customer Figure 2 Product
Figure 4 Customer purchase history
9 | P a g e
2.3 Traditional Recommendation Approach
2.3.1 Algorithm
Find correlated products based on the product purchase by a given customer.
Input: customer_id = X
Output: product_list = pList
Steps 1 : find each product purchase by a customer.
Y � Select product_id from product_purchase_history where
customer_id=X
Step 2 : find all the customer who also bought this product(Y).
Z � select customer_id from product_purchase_history where product_id in
Y
Step 3 : select all product that bought by this customer(Z)
pList � select * from product_purchase_history where customer_id in (Z)
algorithm seems straight forward. We generate view to export data from the SQL
and use as input file to algorithm.
10 | P a g e
2.3.2 Data manipulation
2.3.2.1 SQL View
2.3.2.2 View Result as a flat file
Below is snapshot of exported file from the created VIEW
11 | P a g e
2.3.3 Related Functions
Let’s discuss about the useful functions during this implementation.
def get_product_by_customer(records, related_customer):
use to get all the product purchased by the related_customer array pass
as second parameter.
def get_cosine(vector1, vector2):
use to get cosine value between two product vector
def cosine_similarity(related_product, related_customer_detail):
use to generate cosine matrix for the given customer
2.3.4 Final Coding
import sys
import math import json
import numpy
def main():
def get_product_by_customer(records, related_customer):
my_product_purchase = []
for row in records: if row[0] in related_customer:
my_product_purchase.append(row)
my_product_purchase = [row[1] for row in my_product_purchase]
# my_product_purchase = list(set(my_product_purchase)) return my_product_purchase
def get_cosine(vector1, vector2):
no_of_customer = len(vector1)
with open("vector.txt", "w") as fpv:
json.dump(vector1, fpv)
json.dump(vector1, fpv)
v1 = 0 v2 = 0
numerator = 0
for i in range(0, no_of_customer):
v1_i = float(vector1[i])
v2_i = float(vector2[i])
12 | P a g e
numerator = numerator + (v1_i * v2_i)
v1 = v1 + (v1_i * v1_i)
v2 = v2 + (v2_i * v2_i)
den = (math.sqrt(v1) * math.sqrt(v2)) if den != 0:
cosine = float(numerator / den)
else:
cosine = 0.0
return cosine
def cosine_similarity(customer_product_purchase, related_product,
related_customer_detail):
print "Find Cosine Similarity"
product_vector = {}
# limit the process for top 100 recommendation product from nltk import FreqDist fdist = FreqDist(related_product)
related_product = fdist.most_common(100)
for product in related_product:
product_vector_list = []
for customer in related_customer_detail:
if customer[1] == product[0]:
product_vector_list.append(float(customer[2])) else:
product_vector_list.append(0)
product_vector[product[0]] = product_vector_list
del product_vector_list
# cosine similarity matrix
cosine_similarity_matrix = {}
for i in range(0, len(related_product)):
product_id = related_product[i][0]
cosine_similarity_matrix[product_id] = {}
print "COSINE EVAL : ", product_id
for j in range(i+1, len(related_product)): cosine_product_id = related_product[j][0]
cosine_value = get_cosine(product_vector[product_id],
product_vector[cosine_product_id])
cosine_similarity_matrix[product_id][cosine_product_id] = cosine_value
return cosine_similarity_matrix
print "Initiate Cosine Similarity"
if len(sys.argv) > 1:
customer_id = sys.argv[1]
with open("view_product_purchase.txt") as product_purchase:
next(product_purchase) records = []
customer_product_purchase = []
# get all the purchase detail for row in product_purchase:
13 | P a g e
current_row = row.strip().split("\t")
records.append(current_row)
if customer_id == current_row[0]:
customer_product_purchase.append(current_row[1])
print "=== PRODUCT PURCHASE : TOTAL {}
===".format(len(customer_product_purchase))
# print customer_product_purchase
# find all those customer, who bought the same product
if len(customer_product_purchase) > 0:
related_customer = []
related_customer_detail = []
for row in records:
if row[1] in customer_product_purchase:
related_customer.append(row[0]) related_customer_detail.append(row)
# get most active buyer
related_customer = list(set(related_customer))
print "=== RELATED CUSTOMER : TOTAL {} ===".format(len(related_customer))
# print related_customer
# find all the product bought by related customer
related_product = get_product_by_customer(records,
related_customer)
print "=== RELATED PRODUCT : TOTAL {}
===".format(len(related_product))
cosine_matrix = cosine_similarity(customer_product_purchase,
related_product, related_customer_detail)
# check for top related product according to customer purchase # customer_product_purchase get top list from cosine_matrix
file_name = customer_id + ".txt"
with open(file_name, "w") as fp:
json.dump(cosine_matrix, fp)
14 | P a g e
2.4 Item Similarity Weighted Approach
2.4.1 Algorithm
Step 1 : find all the list of products Iu rated by given customer_id
Step 2 : find list of all similar products J according to traditional approach
Step 3 : Calculate weight(i, j) using cosine similarity value.
Step 4 : evaluate the weighted sum for J, using below equation
���� =∑ �, � ����� �∈��
∑ �, � �∈��
Step 5 : Find the top weighted product using this.
2.4.2 Modification into basic algorithm
Snap short of updated algorithm implementation
for j in similar_product:
for i in customer_rated_product:
# need to find the weight between i and j w = weight(product_vector[i], product_vector[j[0]])
r = customer_rated_product_detail[i][2]
# calculate sum of j numerator += w * float(r)
denominator += w
if denominator > 0:
product_weight[j] = (numerator/denominator)
else: product_weight[j] = 0
from operator import itemgetter
top_products = sorted(product_weight.items(), key=itemgetter(1),
reverse=True)
json_results = [] cursor = connection.cursor()
15 | P a g e
s_q_l_command = "SELECT p.*, ISNULL(d.total,0),
ISNULL(d.downloaded,0), ISNULL(d.avg_rating,0) " \
"from product p LEFT JOIN (select * from
product_detail) d ON d.product_id=p.product_id " \ "WHERE p.product_id = ?"
for recommended_product in top_products[:10]:
values = [recommended_product[0][0]]
cursor.execute(s_q_l_command, values)
results = cursor.fetchone() if results is None:
print recommended_product
else:
json_results.append(results)
file_name = customer_id + "_1_1.txt" with open(file_name, "w") as fp:
json.dump(json_results, fp)
16 | P a g e
3 Algorithm Results
As of study purpose I store all the intermediate output and final recommended
products output to flat file.
Testing Customer_id = A1JHHXJMLKSRW9, rated following products,
Recommended list of products by algorithm
[
["0061007129", "Book", "Kane & Abel", "137", "137", "4.5"],
["0439136350", "Book", "Harry Potter and the Prisoner of Azkaban
(Book 3)", "0", "0", "0"],
["0446525774", "Book", "Saving Faith", "0", "0", "0"],
["0807282324", "Book", "Harry Potter and the Prisoner of Azkaban
(Book 3 Audio CD)", "0", "0", "0"],
["0385490992", "Book", "The Street Lawyer", "0", "0", "0"],
["0747545111", "Book", "Harry Potter and the Prisoner of Azkaban",
"0", "0", "0"],
["0385470819", "Book", "A Time to Kill", "318", "318", "4.5"],
["0140247750", "Book", "The Grapes of Wrath : Text and Criticism;
Revised Edition (Viking Critical Library)", "517", "517", "4.5"],
["0553502220", "Book", "A Time to Kill", "0", "0", "0"],
["B00005R23Y", "DVD", "The Patriot (Superbit Deluxe Collection)",
"0", "0", "0"]
]
17 | P a g e
4 Current Limitation
Current Recommendation system works only on the customer related
purchased and rating. However, there are multiple factor affecting for user to buy
next product. For example, text classification of the product for specific categories;
like books, movies, drama, short-films, entertainment medias. Basic correlation
between two categories.
Initial filtering between of product and customer list based on the most
famous products, or lest common product. Excluding those customers who
purchased a lot or least among others.
Thus, Recommending the product based on those all the factors is major
challenges for computer science professional.
18 | P a g e
5 Future Enhancement
we are positively thinking about enhancing the project to a next level of
recommendation system. Which include personalized attributes like location of
customer, time of the purchase.
According to the current scenario we say that location is major attribute and
it help us to increase accuracy of the algorithm. dataset for the recommendation
system is always huge for large enterprise. So, Initial filtering based on the location
is very important to reduce the N x N matrix, where N is list of products. When we
successfully interpret the time of purchase to some products. We can predict the
recurring purchases.
Each year based on customer’s day to day purchase history, we can analysis
and predict purchased for similar customer. Most of the customer are purchasing
house hold products online. Thus, recommendation from most popular categories
may makes our system more accurate.
However, we must implement multiple algorithm for any single ecommerce.
Which have major two categories.
1. Online Recommendation
2. Offline Recommendation
19 | P a g e
6 Reference
1. Python Doc
https://docs.python.org/devguide/
2. Amazon Data Set
https://snap.stanford.edu/data/amazon-meta.html
3. Greg Linden, Brent Smith, and Jeremy York ”amazon.com Recommendations
item to item collaborative filtering” Amazon.com, 2003
4. Lei Deng, Xi’an, Jerry Gao and Chandrasekar Vuppalapati “Building a Big
Data Analytics Service Framework for Mobile Advertising and Marketing”
5. Jinafeng Hu, Bo Zhang, Product Recommendation system, Standford, 2012
http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-044-
final.v01.pdf