20
Web Personalization Based on Static Web Personalization Based on Static Information and Dynamic User Information and Dynamic User Behavior Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang Hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Massimiliano Albanese, Antonio Picariello, Carlo and Lucio Sansone Department of Information and Communication Systems, University of Naples (Napoli) “Federico II”, Naples, Italy WIDM '04

Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Embed Size (px)

Citation preview

Page 1: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Web Personalization Based on Static Web Personalization Based on Static Information and Dynamic User Information and Dynamic User BehaviorBehavior

Center for E-Business TechnologySeoul National University

Seoul, Korea

Nam, Kwang Hyun

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Massimiliano Albanese, Antonio Picariello, Carlo and Lucio Sansone

Department of Information and Communication Systems,University of Naples (Napoli) “Federico II”, Naples, Italy

WIDM '04

Page 2: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

ContentsContents

Introduction

Various personalization schemes

System architecture

The web usage mining strategy

Pattern analysis, clustering, and classification

The re-classification algorithm

Web personalization

Experimental results

Conclusions

Evaluation

IDS Lab Seminar (User Log Analysis) - 2

Page 3: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

IntroductionIntroduction

3 types of data which have to be managed in a web

Content - whatever is in a web page

Structure - organization of the content

Log(usage) data - usage patterns of web sites

The application of data mining techniques depends on data types

Web content mining, web structure mining, web usage mining

In this paper, web usage mining is considered.

Described as the process of customizing the content and the structure of web sites in order to provide users with information

IDS Lab Seminar (User Log Analysis) - 3

Page 4: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Various personalization schemesVarious personalization schemes

Yan et al

A methodology for the automatic classification of web users according to their access patterns, using cluster analysis on the web logs

Perkowitz and Etzioni

First describes adaptive web sites as sites that semi-automatically improve their organization by learning from visitor access patterns – algorithm based on a clustering methodology

Srivastava et al

A prototype system (WebSIFT) which performs intelligent cleansing and preprocessing for identifying users

IDS Lab Seminar (User Log Analysis) - 4

Page 5: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Various personalization schemesVarious personalization schemes

Grandi

Time-related issue

Introduces an exhaustive annotated bibliography on temporal and evolution aspects in the World Wide Web

The novelty of this paper’s approach

Propose a clustering process made up of 2 phases

1st phase

– A pattern analysis and classification is performed by means of an unsupervised clustering algorithm, using the registration information

2nd phase

– Re-classification is iteratively repeated until a suitable convergence is reached to overcome the inaccuracy of the registration information

IDS Lab Seminar (User Log Analysis) - 5

Page 6: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

System architectureSystem architecture

IDS Lab Seminar (User Log Analysis) - 6

Page 7: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

System architectureSystem architecture

User profiling

The process of gathering information specific to each visitor

Log analysis / web usage mining

The process of analyzing the information stored in web server logs

– To extract statistical information and discover interesting usage patterns

– To cluster the users into groups according to their behavior

– To discover potential correlations between web pages and user groups

Content management

The process of classifying the content of a web site into semantic categories to make information retrieval and presentation easier

Web site publishing

A publishing mechanism that is used to present the content stored in the web server

IDS Lab Seminar (User Log Analysis) - 7

Page 8: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

The web usage mining strategyThe web usage mining strategy

1st phase

The users are attributed to a tentative class by means of an unsupervised clustering algorithm.

Uses the static information provided by the users

Determines the number of classes the users can belong to

2nd phase

A novel re-classification algorithm is iteratively repeated until a suitable convergence in attributing each user to a class is reached

Re-classification is used to overcome the inaccuracy.

Accomplished by the log analysis and content management modules

– On the basis of the dynamic navigational behavior of the users

IDS Lab Seminar (User Log Analysis) - 8

Page 9: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Pattern analysis, clustering, and Pattern analysis, clustering, and classificationclassification

The task of user profiling module is to associate each user to the class that better describes her behavior

To accomplish this, user have to mapped into a feature space.

Issue - the definition of the classes the users can belong to

Unsupervised clustering procedure can be used for partitioning the feature space into a certain number of clusters

In order to choice the optimal number of clusters, a possible approach could be the maximization of an index

IDS Lab Seminar (User Log Analysis) - 9

Page 10: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Pattern analysis, clustering, and Pattern analysis, clustering, and classificationclassification

C index

The minimization of C index indicates a good clustering.

– ∵ Pairs of patterns with a small distance are in the same cluster.

The denominator is used for normalization.

Also used for comparing different clustering techniques

1) AutoClass C

- A fuzzy clustering algorithm based on the Bayesian theory

2) the Rival Penalized Competitive Learning (RPCL) algorithm

- Used for training a competitive neural network

IDS Lab Seminar (User Log Analysis) - 10

S : The sum of distances over all pairs of patterns from the same clusterSmin : the sum of l smallest distancesSmax : the sum of the l largest distances out of all pairs

( Let l be the number of those pairs )C ∈ [0,1]

Page 11: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

The re-classification algorithmThe re-classification algorithm

Basic idea

1) Classify the web site resources based on the type of users that have accessed them

2) Re-classify the users based on both the class (previously assigned) and the resources (accessed since the last re-classification)

Content management module

Selects all the significant words appearing into the different resources of the web site.

Associates them to a specific content category.

IDS Lab Seminar (User Log Analysis) - 11

Page 12: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

The re-classification algorithmThe re-classification algorithm

Log analysis module Registers all the activities of the users.

In order to use this information for re-classifying users, we need to attribute each content category to a specific user class

The process of attributing each category to a class can be accomplished by considering the first classification performed by

– The chosen clustering algorithm

– Counting the number of times the users of a given class ask for something belonging to a specific content category with each resource type

Each content category can be attributed to a class with a probability that is proportional to the number of user requests.

The whole process will lead to convergence if after a suitable number of re-classifications the number of re-classified users leads to ‘zero’.

IDS Lab Seminar (User Log Analysis) - 12

Page 13: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Web personalizationWeb personalization

Given the class of user, the contents related to the categories attributed to that class will be shown by the web site publishing module on her personalized home page when she logs in the web site.

Example

All the news and the stories containing keywords related to those categories will be presented, as well as the results of last queries involving the same keywords.

IDS Lab Seminar (User Log Analysis) - 13

Page 14: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Experimental resultsExperimental results

Use SOAP (a light weight, XML based protocol for exchange of information in a decentralized, distributed environment.)

A commercial web site ‘pariare.com’ Provide information about entertainment in Napoli.

2682 users who already registered to the web site.

The features used to initially classify the users are :

– Age

– Sex

– Category of place in which users prefer to go

– Number of times per week in which users go out

– Preferred day of the week to go out

– The Pariapoli parameter (a measure of the degree of interest towards the virtual community of Pariare)

– Type of entertainment users are looking for

IDS Lab Seminar (User Log Analysis) - 14

Page 15: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Experimental resultsExperimental results

IDS Lab Seminar (User Log Analysis) - 15

The optimal number of clusters

Page 16: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Experimental resultsExperimental results

IDS Lab Seminar (User Log Analysis) - 16

K : The overall percentage of re-classified users amounts

(i,j) : the percentage of i-class users who have been re-classified as j-class(i,i) : the percentage of i-class users

which have not been re-classified

K = 3.14%

K = 0.62%

K = 0.16%

Page 17: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Experimental resultsExperimental results

IDS Lab Seminar (User Log Analysis) - 17

The convergence is even faster

∵ The value of the C index is lower for the RPCL clustering than

for the AutoClass C clustering.

K = 1.78%

K = 0.9%

K = 0%

Page 18: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

Experimental resultsExperimental results

IDS Lab Seminar (User Log Analysis) - 18

AutoClass C RPCL

Page 19: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

ConclusionsConclusions

Introduce a novel web usage mining strategy for web personalization

The novelty of this strategy relies in the fact that web site users are clustered through a two-phase process that takes into account both static information and dynamic behavior

The experimental results have confirmed the effectiveness of suggested approach.

It leads to a stable classification whatever the initial classification is.

IDS Lab Seminar (User Log Analysis) - 19

Page 20: Web Personalization Based on Static Information and Dynamic User Behavior Center for E-Business Technology Seoul National University Seoul, Korea Nam,

Copyright 2008 by CEBT

EvaluationEvaluation

Pros

Interesting subject

Introduce various personalization schemes

Offer architecture and modules

Show detail experiment and results

Cons

Few examples

– Most of contents are just explanations.

Many sentences are too long.

– This causes hardness to read and understand the contents.

IDS Lab Seminar (User Log Analysis) - 20