15
Benefits of InterSite Pre- Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse AxIS Research Team INRIA Sophia Antipolis and Rocquencourt

Motivations

Embed Size (px)

DESCRIPTION

Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse AxIS Research Team INRIA Sophia Antipolis and Rocquencourt. Motivations. To show on the clickstream dataset proposed - PowerPoint PPT Presentation

Citation preview

Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain

Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse

AxIS Research TeamINRIA Sophia Antipolis and Rocquencourt

Motivations

To show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge

the benefits of our InterSite pre-processing method proposed by Tanasa in his PhD Thesis (2005)

And

the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs

2 main viewpoints: User and web site charge

Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit

« Group of SessionIDs » - first statistical Intersite analysis

2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4

3. Conclusions

Table 1. Format of page requests

ShopID Date IP address SessionID Page Referrer

11 1074585663 213.151.91.186 939dad92c4…84208dca /

11 1074585670 213.151.91.186 87ee02ddcff…7655bb9e /ct/?c=148 http://www.shop2.cz

Table 2. Number of requests per shop

ShopID Site name (shop) #Requests

10 www.shop1.cz 509,688

11 www.shop2.cz 400,045

12 www.shop3.cz 645,724

14 www.shop4.cz 1,290,870

15 www.shop5.cz 308,367

16 www.shop6.cz 298,030

17 www.shop7.cz 164,447

Data pre-processing

Initial data:

Data pre-processing

Tanasa & Trousse (IEEE Intelligent Systems 2004)Tanasa ‘s Thesis (2005)

Table 3. Transformed log lines

Datetime IP SessionID URL Referrer

2004-01-20 09:01:03 213.151.91.186 939dad92c4…84208dca http://www.shop2.cz/ -

2004-01-20 09:01:10 213.151.91.186 87ee02ddcff…7655bb9e http://www.shop2.cz/ct/?c=148 http://www.shop2.cz/

Data pre-processing

• Data Structuration SessionID a single visit on each shop Towards the notion of user’s intersite visit: we group such SessionIDs that belongs to a single user (same IP) into a « Group of SessionIDs ». We compare the Referer with the URLs previously accessed (in a reasonable time window)

522, ,410 SessionIDs into 397,629 Groups, equivalent to a 23.88% reduction;

• Data fusion, data cleaning

Relational DB modelData summarisation

0

1000

2000

3000

4000

5000

6000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Hour

Vis

its

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

0

50

100

150

200

250

300

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

HourG

roup

s

Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop

Data pre-processing

• Low number of new visits on Saturdays and Sundays during the lunch time• The high number of new visits on Tuesdays and Wednesdays• Same results a) and b)

Crossed Clustering Aproach for Time Periods/Product Analysis

Data: Selection of ls pages in shop 4 (the most used)

Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)

0

200 000

400 000

600 000

800 000

1 000 000

1 200 000

1 400 000

10 11 12 14 15 16 17

Shop

Acc

ess

/ct /ls /dt /znacka /akce others

Crossed Clustering Aproach for Time Periods/Product Analysis

Relational BD model : We add easily a crossed table

Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals

Column: a multi-categorical variable representing the number of products requested

by users into the specific time slice

Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)

Crossed Clustering Aproach for Time Periods/Product Analysis

Table 4. Quantity of products requested by weekday x hour and registered on shop 4

Weekday x Hour Product (number of requests)

Monday_0Built-in electric hobs (10),Built-in dish washers 60cm (64),Corner single sinks (50), ...

Monday_1

Free standing combi refrigerators (44),Corner single sinks (50), Built-in hoods (60), ...

… …

Sunday_22Built-in microwave ovens (27),Built-in dish washers 45cm (38),Built-in dish washers 60cm (85), ...

Sunday_23Built-in freezers (56),Kitchen taps with shower (45), Garbage disposers (32), ...

Crossed Clustering Aproach for Time Period/Product Analysis

Table 5. Confusion table

Product_1 Product _2 Product _3 Product _4 Product _5 Total

Period_ 1 2847 5084 3284 2265 2471 15951

Period_ 2 11305 31492 12951 1895 9610 67253

Period _3 33107 55652 36699 5345 20370 151173

Period _4 22682 46322 30200 5165 27659 132028

Period _5 9576 20477 19721 2339 7551 59664

Period _6 1783 3515 2549 392 11240 19479

Period _7 15019 14297 8608 1397 6014 45335

Total 96319 176839 114012 18798 84915 490883

57,7%

Crossed Clustering Aproach for Time Period/Product Analysis

Example of one surprising result:

the class Product 5 is defined by one type of products « Free standing combi refrigerators »

consulted predominantly on Fridays from 17:00 to 20:00 (class period 6)

57,7% of such a product type requested on this period

Conclusions

1. Intersite Data Pre-Processing - structuration into user’s intersite visits

« Group of SessionIDs » - first statistical Intersite analysis

- anomalies and recommandations for the dataset

2. Crossed Clustering Approach - first application of such a method on time periods of Web logs

and in e-commerce domain - promising results

Data pre-processing

Inconsistency problems:

- table kategorie: found repeated entries and different entries with same ID

- for some page types (dt, df) the given parameter represented actually a specific product, not the given product description (from products table).

- extra parameters equivalent to the give ones for some page types:i.e. for ct page type, id is equivalent to the given c parameter

- missing values (descriptions) in tables: 3 values in product table and 64 in category table

- multiple site SessionIDs: 13 cross-server visits had same SessionID on the visited sites (up to 4 sites); SessionID should change on each new site;

- multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than one IP (anonymization proxies ?).