20
Final Report Building a Framework for Recommendation Web Service System with Amazon Product Data CIS 698 : Independent Study Presented to Dr. Sunnie S. Chung Cleveland State University By Sagar Dahiwala CSU ID : 2689129

Final Report Aug09 - Cleveland State University

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Final Report Aug09 - Cleveland State University

Final Report

Building a Framework for Recommendation Web

Service System with Amazon Product Data

CIS 698 : Independent Study

Presented to

Dr. Sunnie S. Chung

Cleveland State University

By

Sagar Dahiwala

CSU ID : 2689129

Page 2: Final Report Aug09 - Cleveland State University

1 | P a g e

Acknowledgements

I would like to thank Dr. Sunnie S. Chung for being my advisor and guide. I

am grateful to her for her continuous support and invaluable inputs she has been

providing me through the development of the project. This work would not have

been possible without her support and encouragement.

I would like to express my sincere thanks to Dr. Haodong Wang, MCIS

Program Director for allowing me to work on this topic under Dr. Sunnie S. Chung.

I would like to express my gratitude and appreciation to Mr. Wayne Largent,

Director, Mind Streams LLC for providing flexibility in time and all the way to work

on my project during summer internship.

I would like to thank my parents and all my dear friends who have been

rendering continuous moral support, encouragement, and helping me complete the

task successfully.

Page 3: Final Report Aug09 - Cleveland State University

2 | P a g e

Abstract

Building accurate recommendation system is game changer in today’s fast-

growing ecommerce industry. There are so many research going on in this field to

develop broad system that can cover major industries.

Currently the recommendation system for is dependent on some specific

attribute related to customer only. However, the large retailer has huge customer data

and attribute. Major challenges are the accuracy of the online algorithm. How

efficient and accurate your system directly affects the final sales of the company.

We start using traditional collaborative filtering approaches by implementing

cosine similarity. Successfully Handle all the possible scaling issues. Enhance the

traditional algorithm by adding the weighted similarity approach.

Building framework for recommendation web Service System with Amazon

Product Data will be a major change in the recommendation algorithm.

Page 4: Final Report Aug09 - Cleveland State University

3 | P a g e

Table of Contents

Acknowledgements .................................................................................................... 1

Abstract ...................................................................................................................... 2

Table of Contents ....................................................................................................... 3

Introduction ................................................................................................................ 4

Data format ............................................................................................................. 4

Graphical representation of item-to-item collaborative filtering ............................ 6

Implementation .......................................................................................................... 7

Dataset in flat file .................................................................................................... 7

Database .................................................................................................................. 8

Traditional Recommendation Approach................................................................. 9

Algorithm ............................................................................................................. 9

Data manipulation ..............................................................................................10

Related Functions ..............................................................................................11

Final Coding ......................................................................................................11

Item Similarity Weighted Approach .....................................................................14

Algorithm ...........................................................................................................14

Modification into basic algorithm .....................................................................14

Algorithm Results ....................................................................................................16

Current Limitation ....................................................................................................17

Future Enhancement ................................................................................................18

Reference..................................................................................................................19

Page 5: Final Report Aug09 - Cleveland State University

4 | P a g e

1 Introduction

Most Collaborative filtering based recommender system build a neighborhood

of likeminded customer. The neighborhood formation scheme usually uses Pearson

correlation or cosine similarity as a measure of proximity.

After finding proximity neighborhood, it will produce two types of

recommendations.

1. Prediction of how much a customer C will like a product P.

2. Recommendation of a list of Products for a customer C.

This Algorithm has some limitations,

Sparsity: Algorithm rely on exact matches for neighborhood formation.

Scalability: computation of algorithm grows with both number of customer and

products

Synonymy: Different product name can refer to same product. Correlation based

recommender systems would be unable to discover the latent association and treat

those products differently.

While building the framework for recommendation web system we have taken care

of each limitations of existing system. To start with, let’s discuss about the dataset

we use.

Full information about Amazon Share the Love products. Total items: 548552

1.1 Data format

• Id: Product id (number 0, ..., 548551)

• ASIN: Amazon Standard Identification Number

• title: Name/title of the product

• group: Product group (Book, DVD, Video or Music)

• salesrank: Amazon Sales rank

• similar: ASINs of co-purchased products (people who buy X also buy Y)

Page 6: Final Report Aug09 - Cleveland State University

5 | P a g e

• categories: Location in product category hierarchy to which the product

belongs (category id in [])

reviews: list of Product review information: user id, rating, total number of votes

on the review, total number of helpfulness votes (how many people found the

review to be helpful)

Let discuss one parsed JSON data set and its field.

{

'asin': '0827229534',

'group': 'Book',

'title': 'Patterns of Preaching: A Sermon Sampler',

'reviews': [

{

'rating': '5',

'cutomer': 'A2JW67OY8U6HHK',

'votes': '10',

'helpful': '9'

},

{

'rating': '5',

'cutomer': 'A2VE83MZF98ITY',

'votes': '6',

'helpful': '5'

}

],

'salesrank': '396585',

'similar': [

'0804215715',

'156101074X',

'0687023955',

'0687074231',

'082721619X'

],

'id': '1',

'categories': [

[

'Books[283155]',

'Subjects[1000]',

'Religion & Spirituality[22]',

'Christianity[12290]',

'Clergy[12360]',

'Preaching[12368]'

],

[

'Books[283155]',

'Subjects[1000]',

Page 7: Final Report Aug09 - Cleveland State University

6 | P a g e

'Religion & Spirituality[22]',

'Christianity[12290]',

'Clergy[12360]',

'Sermons[12370]'

]

]

}

1.2 Graphical representation of item-to-item collaborative filtering

ITEM CUSTOMER ITEM

1

2

A

B

1

2

3

4

1-3

1-4

1-2

2-1

Page 8: Final Report Aug09 - Cleveland State University

7 | P a g e

2 Implementation

Before start with actual algorithm implementation. We should understand the

dataset. Here, Amazon review meta dataset is available in flat file. Which was taken

from the Stanford large network dataset collection. Initially we parse single file

dataset to separate files to make it easier for insertion into database. Using the SQL

import Export tool, I imported all the data to separate tables. Major issue during

import is query execution error. Some of the character filtering required due to such

error.

2.1 Dataset in flat file

Id: 2

ASIN: 0738700797

title: Candlemas: Feast of Flames

group: Book

salesrank: 168596

similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940

categories: 2

|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based

Religions[12472]|Wicca[12484]

|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based

Religions[12472]|Witchcraft[12486]

reviews: total: 12 downloaded: 12 avg rating: 4.5

2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4

2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5

2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8

2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4

2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16

Page 9: Final Report Aug09 - Cleveland State University

8 | P a g e

2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5

2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6

2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8

2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5

2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5

2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9

2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1

2.2 Database

Figure 3 Product Rating

Figure 1 Customer Figure 2 Product

Figure 4 Customer purchase history

Page 10: Final Report Aug09 - Cleveland State University

9 | P a g e

2.3 Traditional Recommendation Approach

2.3.1 Algorithm

Find correlated products based on the product purchase by a given customer.

Input: customer_id = X

Output: product_list = pList

Steps 1 : find each product purchase by a customer.

Y � Select product_id from product_purchase_history where

customer_id=X

Step 2 : find all the customer who also bought this product(Y).

Z � select customer_id from product_purchase_history where product_id in

Y

Step 3 : select all product that bought by this customer(Z)

pList � select * from product_purchase_history where customer_id in (Z)

algorithm seems straight forward. We generate view to export data from the SQL

and use as input file to algorithm.

Page 11: Final Report Aug09 - Cleveland State University

10 | P a g e

2.3.2 Data manipulation

2.3.2.1 SQL View

2.3.2.2 View Result as a flat file

Below is snapshot of exported file from the created VIEW

Page 12: Final Report Aug09 - Cleveland State University

11 | P a g e

2.3.3 Related Functions

Let’s discuss about the useful functions during this implementation.

def get_product_by_customer(records, related_customer):

use to get all the product purchased by the related_customer array pass

as second parameter.

def get_cosine(vector1, vector2):

use to get cosine value between two product vector

def cosine_similarity(related_product, related_customer_detail):

use to generate cosine matrix for the given customer

2.3.4 Final Coding

import sys

import math import json

import numpy

def main():

def get_product_by_customer(records, related_customer):

my_product_purchase = []

for row in records: if row[0] in related_customer:

my_product_purchase.append(row)

my_product_purchase = [row[1] for row in my_product_purchase]

# my_product_purchase = list(set(my_product_purchase)) return my_product_purchase

def get_cosine(vector1, vector2):

no_of_customer = len(vector1)

with open("vector.txt", "w") as fpv:

json.dump(vector1, fpv)

json.dump(vector1, fpv)

v1 = 0 v2 = 0

numerator = 0

for i in range(0, no_of_customer):

v1_i = float(vector1[i])

v2_i = float(vector2[i])

Page 13: Final Report Aug09 - Cleveland State University

12 | P a g e

numerator = numerator + (v1_i * v2_i)

v1 = v1 + (v1_i * v1_i)

v2 = v2 + (v2_i * v2_i)

den = (math.sqrt(v1) * math.sqrt(v2)) if den != 0:

cosine = float(numerator / den)

else:

cosine = 0.0

return cosine

def cosine_similarity(customer_product_purchase, related_product,

related_customer_detail):

print "Find Cosine Similarity"

product_vector = {}

# limit the process for top 100 recommendation product from nltk import FreqDist fdist = FreqDist(related_product)

related_product = fdist.most_common(100)

for product in related_product:

product_vector_list = []

for customer in related_customer_detail:

if customer[1] == product[0]:

product_vector_list.append(float(customer[2])) else:

product_vector_list.append(0)

product_vector[product[0]] = product_vector_list

del product_vector_list

# cosine similarity matrix

cosine_similarity_matrix = {}

for i in range(0, len(related_product)):

product_id = related_product[i][0]

cosine_similarity_matrix[product_id] = {}

print "COSINE EVAL : ", product_id

for j in range(i+1, len(related_product)): cosine_product_id = related_product[j][0]

cosine_value = get_cosine(product_vector[product_id],

product_vector[cosine_product_id])

cosine_similarity_matrix[product_id][cosine_product_id] = cosine_value

return cosine_similarity_matrix

print "Initiate Cosine Similarity"

if len(sys.argv) > 1:

customer_id = sys.argv[1]

with open("view_product_purchase.txt") as product_purchase:

next(product_purchase) records = []

customer_product_purchase = []

# get all the purchase detail for row in product_purchase:

Page 14: Final Report Aug09 - Cleveland State University

13 | P a g e

current_row = row.strip().split("\t")

records.append(current_row)

if customer_id == current_row[0]:

customer_product_purchase.append(current_row[1])

print "=== PRODUCT PURCHASE : TOTAL {}

===".format(len(customer_product_purchase))

# print customer_product_purchase

# find all those customer, who bought the same product

if len(customer_product_purchase) > 0:

related_customer = []

related_customer_detail = []

for row in records:

if row[1] in customer_product_purchase:

related_customer.append(row[0]) related_customer_detail.append(row)

# get most active buyer

related_customer = list(set(related_customer))

print "=== RELATED CUSTOMER : TOTAL {} ===".format(len(related_customer))

# print related_customer

# find all the product bought by related customer

related_product = get_product_by_customer(records,

related_customer)

print "=== RELATED PRODUCT : TOTAL {}

===".format(len(related_product))

cosine_matrix = cosine_similarity(customer_product_purchase,

related_product, related_customer_detail)

# check for top related product according to customer purchase # customer_product_purchase get top list from cosine_matrix

file_name = customer_id + ".txt"

with open(file_name, "w") as fp:

json.dump(cosine_matrix, fp)

Page 15: Final Report Aug09 - Cleveland State University

14 | P a g e

2.4 Item Similarity Weighted Approach

2.4.1 Algorithm

Step 1 : find all the list of products Iu rated by given customer_id

Step 2 : find list of all similar products J according to traditional approach

Step 3 : Calculate weight(i, j) using cosine similarity value.

Step 4 : evaluate the weighted sum for J, using below equation

���� =∑ �, � ����� �∈��

∑ �, � �∈��

Step 5 : Find the top weighted product using this.

2.4.2 Modification into basic algorithm

Snap short of updated algorithm implementation

for j in similar_product:

for i in customer_rated_product:

# need to find the weight between i and j w = weight(product_vector[i], product_vector[j[0]])

r = customer_rated_product_detail[i][2]

# calculate sum of j numerator += w * float(r)

denominator += w

if denominator > 0:

product_weight[j] = (numerator/denominator)

else: product_weight[j] = 0

from operator import itemgetter

top_products = sorted(product_weight.items(), key=itemgetter(1),

reverse=True)

json_results = [] cursor = connection.cursor()

Page 16: Final Report Aug09 - Cleveland State University

15 | P a g e

s_q_l_command = "SELECT p.*, ISNULL(d.total,0),

ISNULL(d.downloaded,0), ISNULL(d.avg_rating,0) " \

"from product p LEFT JOIN (select * from

product_detail) d ON d.product_id=p.product_id " \ "WHERE p.product_id = ?"

for recommended_product in top_products[:10]:

values = [recommended_product[0][0]]

cursor.execute(s_q_l_command, values)

results = cursor.fetchone() if results is None:

print recommended_product

else:

json_results.append(results)

file_name = customer_id + "_1_1.txt" with open(file_name, "w") as fp:

json.dump(json_results, fp)

Page 17: Final Report Aug09 - Cleveland State University

16 | P a g e

3 Algorithm Results

As of study purpose I store all the intermediate output and final recommended

products output to flat file.

Testing Customer_id = A1JHHXJMLKSRW9, rated following products,

Recommended list of products by algorithm

[

["0061007129", "Book", "Kane & Abel", "137", "137", "4.5"],

["0439136350", "Book", "Harry Potter and the Prisoner of Azkaban

(Book 3)", "0", "0", "0"],

["0446525774", "Book", "Saving Faith", "0", "0", "0"],

["0807282324", "Book", "Harry Potter and the Prisoner of Azkaban

(Book 3 Audio CD)", "0", "0", "0"],

["0385490992", "Book", "The Street Lawyer", "0", "0", "0"],

["0747545111", "Book", "Harry Potter and the Prisoner of Azkaban",

"0", "0", "0"],

["0385470819", "Book", "A Time to Kill", "318", "318", "4.5"],

["0140247750", "Book", "The Grapes of Wrath : Text and Criticism;

Revised Edition (Viking Critical Library)", "517", "517", "4.5"],

["0553502220", "Book", "A Time to Kill", "0", "0", "0"],

["B00005R23Y", "DVD", "The Patriot (Superbit Deluxe Collection)",

"0", "0", "0"]

]

Page 18: Final Report Aug09 - Cleveland State University

17 | P a g e

4 Current Limitation

Current Recommendation system works only on the customer related

purchased and rating. However, there are multiple factor affecting for user to buy

next product. For example, text classification of the product for specific categories;

like books, movies, drama, short-films, entertainment medias. Basic correlation

between two categories.

Initial filtering between of product and customer list based on the most

famous products, or lest common product. Excluding those customers who

purchased a lot or least among others.

Thus, Recommending the product based on those all the factors is major

challenges for computer science professional.

Page 19: Final Report Aug09 - Cleveland State University

18 | P a g e

5 Future Enhancement

we are positively thinking about enhancing the project to a next level of

recommendation system. Which include personalized attributes like location of

customer, time of the purchase.

According to the current scenario we say that location is major attribute and

it help us to increase accuracy of the algorithm. dataset for the recommendation

system is always huge for large enterprise. So, Initial filtering based on the location

is very important to reduce the N x N matrix, where N is list of products. When we

successfully interpret the time of purchase to some products. We can predict the

recurring purchases.

Each year based on customer’s day to day purchase history, we can analysis

and predict purchased for similar customer. Most of the customer are purchasing

house hold products online. Thus, recommendation from most popular categories

may makes our system more accurate.

However, we must implement multiple algorithm for any single ecommerce.

Which have major two categories.

1. Online Recommendation

2. Offline Recommendation

Page 20: Final Report Aug09 - Cleveland State University

19 | P a g e

6 Reference

1. Python Doc

https://docs.python.org/devguide/

2. Amazon Data Set

https://snap.stanford.edu/data/amazon-meta.html

3. Greg Linden, Brent Smith, and Jeremy York ”amazon.com Recommendations

item to item collaborative filtering” Amazon.com, 2003

4. Lei Deng, Xi’an, Jerry Gao and Chandrasekar Vuppalapati “Building a Big

Data Analytics Service Framework for Mobile Advertising and Marketing”

5. Jinafeng Hu, Bo Zhang, Product Recommendation system, Standford, 2012

http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-044-

final.v01.pdf