Web_Usage_Mining.ppt

Web Usage Mining A case study of the GoMercer.com website

Martin ZhaoMar 16, 2007

Topics

• What is data mining?

• The data mining process

• Web usage mining: basic concepts

• The robust fuzzy relational clustering algorithm

• An application to the GoMercer.com web logs

• Q & A

What is Data Mining? – definition

• A concise definition Finding hidden information from large datasets

• A slightly longer version Data mining is the process of exploration and

analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules

• Differences from accessing info in a database• The query is not well formed or precisely stated• The data needs to be pre-processed before mining• The output is new knowledge, which may not be a

subset of the database

What is Data Mining? – a historical perspective

• Data mining is a relatively new field of study.• The 1st International Conference on Knowledge

Discovery and Data Mining (KDD) was held in 1995• But its roots can be traced back to five areas:

Data Mining

StatisticsBayes theorem (1700s)Regression (1900s)Classification (1960s)K-means clustering (1970s)

Artificial IntelligenceNeural networks (1940s)Genetic algorithms (1970s)Decision tree alg.s (1980s)

Algorithms

Information RetrievalSimilarity measures (1960s)Clustering (1960s)SMART IR systems (1970s)

DatabasesBatch reports (1960s)Relational data models (1970s)Data warehousing & OLAP (1990s)

Why Data Mining?

• The growth of data is the most important factor propelling the growth of data mining• In 2003, Wal-Mart captured 20 million

transactions per day in a 10-terabyte database (1TB = 106 MB)

• In 1950, the largest companies had only several dozen megabytes

• The total amount of data that were produced in 2002 was estimated as 5 exabytes (1XB = 106 TB)

• 40% of this was produced in the US

• When we have more data, we are expecting more sophisticated information from them

Business Intelligence – from data to knowledge

Data-Factual information-May be incomplete-Stored in huge amount

Information-Relevant data-Well formatted-For targeted audience

Knowledge-Models, patterns, and rules -Can be used in prediction

IntelligenceUsing knowledge in decision making

Basic Data Mining Tasks

• Classification (map data into predefined groups)

• Regression (map a data item to a real valued prediction variable)

• Prediction (similar to classification, but deal with a future state)

• Clustering (similar to classification, but the groups are defined by the data)

• Association rules (identifies association among data)

• Sequence discovery (determine sequential patterns in data)

The Data Mining Process – the steps

• Develop an understanding of the purpose

• Obtain the dataset to be used

• Explore, clean, and preprocess the data

• Reduce the data, if necessary

• Determine the data mining tasks

• Choose the data mining techniques to be used

• Use algorithm to perform the task

• Interpret the results

• Deploy the model

Phases in the DM ProcessPhases in the DM Process – CCRISP-DMRISP-DM

Web Data Mining

• Web mining: the use of data mining techniques to automatically discover and extract useful and novel information from web docs and services

• Web mining can be categorized as• Content mining: extract model from web contents,

such as text, images, video, and semi- structures (HTML or XML) or structures documents (digital libraries)

• Structure mining: aims at finding the underlying topology and organization of web resources

• Usage mining: discover usage patterns from web server log files, user queries, and registration data

User Clustering and Profiling – goals goals

• Major application areas for web usage mining• Personalization• System improvement• Site modification• Business intelligence• Usage characterization

User Clustering and Profiling – processprocess

• Data cleaning• omitting entries about individual objects on a page

(such as .gif or .jpg image files)

• (User and) session identification: • including identifying distinct pages, IPs, and agents• a session is a sequence of page views accessed

through a certain IP using a certain agent within a certain amount of time (set as 45 minutes)

• Clustering and profiling:• Define similarity between page views• Categorize user sessions into clusters based on

similarity of the pages visited

Web Log File Entries

• Web log files keep track of the following data • Date and time (e.g., 2006-10-01@00:01:01)• Client IP address (e.g., 70.168.242.49)• Server IP address (e.g., 192.168.1.52 or www.GoMercer.com)

• URI stem (web page or a specific file requested, e.g., /choose-mercer/apply-online.aspx)

• User Agent (browser used by the user, e.g., Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322))

• Referrer (the previous page visited)• Cookie • Etc

Data Model

UserSession

Web Page

WebBrowser

IPAddress

1 *

*

1

* 5..*

Within 45 minutes

UserCluster

1

*

Session Identification

1. Use original web server log files as input2. Parse log entries to omit individual objects

(such as images), and a. Keep track of unique client IPs, URIs of interest,

and user agentsb. Keep track of date/time and identifiers for IP, URI,

and agent for each entry of interest

3. For each entry of interesta. add the URI to an existing session with the same

{IP, agent} identifiers and within 45 minutesb. create a new session with the URI

4. Persist the session information to a file (or DB)

Sample Session Information

8 6 6

Inter-cluster distance (gap used here)

Clustering – a one-dimensional example

0

1

2

50 55 60 65 70 75 80 85 90 95

Classification:Map data into pre-defined groups

Clustering:Just specify number of groups.Groups themselves are defined by data

Intra-cluster distance

3 4 2.13 3.33

Maximize the inter-clusterdistance and minimize the

intra-cluster distance

Let’s try to group this set of test scores into letter grades

Page and Session (Dis-)Similarity

• The “syntactic” similarity between (the URL’s of) the ith and jth pages, is defined as the smaller of 1 and the ratio of the overlap of the two and the larger of the two lengths Su(i, j) = min(1, |pi^pj|/max(1, max(|pi|, |pj|))• For instance, the similarity score for

/mercer-411/contact.aspx and /mercer-411/ask-a-student.aspx is 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml is 0

• Dissimilarity is defined as (1 - Su(i, j))2

• Dissimilarity between two clusters is then calculated by summing up pair-wise dissimilarity scores

Page Similarity – an example

• For instance, the similarity score for /mercer-411/contact.aspx and /mercer-411/ask-a-student.aspx is 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml

/

/mercer-411

/contact.aspx /ask-a-student.aspx

/assets

/flash

/location.xml/CLA_1.flv …

…

…

Medoid and Membership

• Each cluster is represented by a medoid, which is a centrally located session in the cluster

• The affiliation of a session to a cluster is represented as a membership score, or the similarity to the corresponding medoid • A session is not considered to exclusively belong to

a single cluster• The affiliation is determined by the highest

membership score in a given iteration

Relational Clustering Algorithm

1. Use identified sessions as input2. Specify number of clusters, C and maximum

number of iterations, M to be used3. Choose an initial medoid for each cluster i in [1, C]

4. Compute membership uij for each session j in [1, N] with regard to each cluster i (using the similarity measure)

5. Store the old medoids6. Compute the new medoids to minimize overall

intra-cluster distances7. Repeat steps 4 through 6 until the medoids do not

change or the maximum number of iterations M is reached

Application to GoMercer.com

Meeting w/ Rob Saxon

Obtain & readWeb log files

Preliminary study using CSC data

Parsing data for sessions

Clustering w/ FCMdd

Data analysis& visualization

On going

Results – summary of log files

• 148 files (one per day from 09/29/06 to 02/23/07), totaling about 2.5 GB

• File sizes for Oct 2006 and Feb 2007 as shown• Session counts in the same periods present similar

patterns

0

10000

20000

30000

40000

50000

60000

1-s

un

2-m

on

3-t

ue

4-w

ed

5-t

hu

6-f

ri

7-s

at

8-s

un

9-m

on

10

-tu

e

11

-we

d

12

-th

u

13

-fri

14

-sa

t

15

-su

n

16

-mo

n

17

-tu

e

18

-we

d

19

-th

u

20

-fri

21

-sa

t

22

-su

n

23

-mo

n

24

-tu

e

25

-we

d

26

-th

u

27

-fri

28

-sa

t

29

-su

n

30

-mo

n

31

-tu

e

0

10000

20000

30000

40000

50000

60000

07-2

-1-t

hu

07-2

-2-f

ri

07-2

-3-s

at

07-2

-4-s

un

07-2

-5-m

on

07-2

-6-t

ue

07-2

-7-w

ed

07-2

-8-t

hu

07-2

-9-f

ri

07-2

-10-

sat

07-2

-11-

sun

07-2

-12-

mon

07-2

-13-

tue

07-2

-14-

wed

07-2

-15-

thu

07-2

-16-

fri

07-2

-17-

sat

07-2

-18-

sun

07-2

-19-

mon

07-2

-20-

tue

07-2

-21-

wed

07-2

-22-

thu

07-2

-23-

fri

0

100

200

300

400

500

600

Results – frequencies by URI type

• User client programs (or browsers used)• Main page• ASP scripts

• Breakdown for /accepted, /choose-mercer, and /mercer-411

• Flash videos• Individual videos • Combined by topic

1

10

100

1000

10000

0

200

400

600

800

1000

1200

1400

Aca

dem

ic+

Ove

rvie

w_1

Res

iden

ce+

Life

+1

Ath

letic

s+1

Cam

pus+

Life

+1

CLA

+1

Rel

igio

us_L

ife_1

Gre

ek+

Life

+1

Edu

catio

n+1

Eng

inee

ring+

1

Bus

ines

s+1

mer

cer-

adm

issi

ons-

Rec

+an

d+A

ctiv

ites+

1

Res

iden

ce+

Life

+2

Mus

ic+

1

Gre

ek+

Life

+2

Cam

pus+

Life

+2

CLA

+2

Ath

letic

s+2

Eng

inee

ring+

2

Rec

+an

d+A

ctiv

ites_

1

Rel

igio

us_L

ife_2

Bus

ines

s+2

Mus

ic+

2

Edu

catio

n+2

0

200

400

600

800

1000

1200

1400

1600

1800

Residenc

e_Life

Academ

ic_O

verv

iew

Campus_Lif

eCLA

Athletic

s

Greek

_Life

Religiou

s_Lif

e

Engine

ering

Educa

tion

Busin

ess

Rec_and

_Acti

vites

mer

cer-a

dmiss

ions

Mus

ic

0

500

1000

1500

2000

2500

3000

3500

/accepted/enrollment-

checklist-spring.aspx

/accepted/financial-aid-inform

ation.aspx

/accepted/new-

student-housing.aspx

/choose-mercer/apply-

online.aspx

/choose-m

ercer/checklist-for-

/choose-m

ercer/financial-

/choose-m

ercer/international-

/choose-m

ercer/transfer-

/default.aspx

/mercer-recruitm

ent-video.aspx

/mercer-411/all-

degrees.aspx

/mercer-411/ask-a-student.aspx

/mercer-

411/directions.aspx

/mercer-411/m

ore-m

ercer.aspx

/accepted

/choose-mercer

/mercer-411

0

1000

2000

3000

4000

5000

6000

/ /why-mercer /mercer-life /choose-mercer

/accepted /mercer-411

Results – user cluster and profiles

279 128 156 278 267 145 305 399 190 320 268 279 158 251 225

162 147 166 263 112 150 345 206 233 281 291 151 186 229

Questions and Discussions

References• Data mining for business intelligence, by Shmuli et al,

Wiley Inter-Science, 2007• Data mining, by Dunham, Prentice Hall, 2003• Web mining: applications and techniques, Scime (ed.),

IDEA group, 2005• What is data mining? by Squier, (

www.dama-ncr.org/Library/2001.11.14-Laura%20Squier.ppt)• Automatic web user profiling and personalization using

robust fuzzy relational clustering, by Nasraoui et al, 1999

• Web usage mining: discovery and application of interesting patterns from web data, by Cooley, PhD thesis, Univ. of Minnesota, 2000

Documents

Web_Usage_Mining.ppt