WEB USAGE MINING USING APRIORI ALGORITHM UUM LEARNING CARE

WEB USAGE MINING USING APRIORI ALGORITHM: UUM LEARNING CARE

PORTAL CASEAzizul Azhar bin Ramli

Information System DepartmentFaculty of Information Technology and Multimedia,

Kolej Universiti Teknologi Tun Hussein Onn (KUiTTHO)86400 Parit Raja, Batu Pahat, Johor D/T

Tel: +607-4538000 ext. 8056, Fax: +607-4532119Email: [email protected]

Abstract

The enormous content of information on the World Wide Web makes it obvious

candidate for data mining research. Application of data mining techniques to the World

Wide Web referred as Web mining where this term has been used in three distinct ways;

Web Content Mining, Web Structure Mining and Web Usage Mining. E Learning is one of

the Web based application where it will facing with large amount of data. In order to

produce the university E Learning (UUM Educare) portal usage patterns and user

behaviors, this paper implements the high level process of Web Usage Mining using

basic Association Rules algorithm call Apriori Algorithm. Web Usage Mining consists of

three main phases, namely Data Preprocessing, Pattern Discovering and Pattern

Analysis. Server log files become a set of raw data where it’s must go through with all

the Web Usage Mining phases to producing the final results. Here, Web Usage Mining,

approach has been combining with the basic Association Rules, Apriori Algorithm to

optimize the content of the university E Learning portal. Finally, this paper will present

an overview of results analysis and Web administrator can use the findings for the

suitable valuable actions.

1

KEY WORDS: server log file, data mining, Web mining, Web Usage

Mining, Association Rules, Apriori algorithm.

1.0 PROJECT OVERVIEW

Data mining is a technique used to deduce useful and

relevant information to guide professional decisions and other

scientific research (Chen, Han and Yu, 1996). It is a cost-

effective way of analyzing large amounts of data, especially when

a human could not analyze such datasets.

Massification of the use the internet has made automatic

knowledge extraction from Web log files a necessity. Information

provided are interested in techniques that could learn Web users’

information needs and preferences. This can improve the

effectiveness of their Web sites by adapting the information

structure of the sites to the users’ behavior.

Recently, the advent of data mining techniques for

discovering usage pattern from Web data (Web Usage Mining)

indicates that these techniques can be a viable alternative to

traditional decision making tools (Srivastava et al., 2000). Web

Usage Mining is the process of applying data mining techniques to

the discovery of usage patterns from Web data and is targeted

2

towards applications (Srivastava et al., 2000). Web Usage Mining

mines the secondary data (Web server access logs, browser logs,

user profiles, registration data, user sessions or transactions,

cookies, user queries, mouse clicks and any other data as the

result of interaction with the Web) derived from the interactions

of the users during certain period of Web sessions.

This paper explores the use of Web Usage Mining techniques

to analyze Web log records collected from E-Learning portal (UUM

Educare). Using commercial data Web mining tools (WebLog Expert Lite

3.5 and Sawmill 7) and ARunner 1.0 (prototype of GUI Christian Borgelt

Apriori tool by Shamrie Sainin, FTM, UUM), it have identified

several Web access pattern by applying well known data mining

techniques (Apriori Algorithm) to the access logs of this

educational portal. This includes descriptive statistic and

Association Rules for the portal including support and confidence

to represent the Web usage and user behavior for UUM Educare. The

results and findings for this experimental analysis can be use by

the Web administration and may be upper level in the UUM

community in order to plan the upgrading and enhancement to the

portal presentation.

Objective and Scopes of Project

3

Generally, the main objective of this project is to perform

Web Usage Mining process, specifically:

i. To preprocess UUM Educare server logs files from the

university E-Learning Web servers for determining and

discovering the user access pattern.

ii. To apply the basic Association Rules – Apriori algorithm

for implementation of Web Usage Mining process to

producing usage pattern by determine the most user

interest based on the options that being provided by the

university E-Learning portal.

iii. To analyze the outputs usage patterns and user behaviors

for UUM Educare from the Web Usage Mining implementation

process.

The scopes of this project are:

i. Research Organization: University E-Learning portal (UUM

Educare)

ii. Focus of Project: Extract the server log file from

university E-Learning server on certain of weeks within a

semester, preprocessing the set of raw data, select the

data that contribute for pattern analysis, implement the

pattern mining using basic Association Rule, Apriori

algorithm, in order to produce the final results (rules).

4

2.0 LITERATURE REVIEW

Data Mining, Web Mining and Web Usage Mining

Data mining (DM) is a step from Knowledge Discovery in

Database (KDD) process, which is defined as a “nontrivial process

of identifying valid, novel, potentially useful and ultimately

understandable pattern in data” (Fayyad et al., 1996). The term

pattern here refers some abstract representation of a subset data

of the data, that is, an expression in some language describing a

data subset or a data subset or a model applicable to that

subset.

Data mining efforts associated with the Web, called Web

mining, can be broadly categorized into three areas of interest

based on which part of the Web to mine; Web Content mining, Web

Structure mining, and Web Usage Mining (Kosala and Blockeel,

2000). In Web mining, data can be collected at the server-side,

client-side, proxy servers or a consolidated Web/business

database (Srivastava et al., 2000). The information provided by the

data sources described above can be used to construct several

5

data abstractions, namely users, page-views, click-streams and

server sessions.

Web Usage Mining is defined as the process of applying data

mining techniques to the discovery of usage patterns from Web

logs data which to identify Web user’s behavior (Srivastava et al.,

2000). Web Usage Mining is the type of Web mining activity that

involves an automatic discovery of user access patterns from one

or more Web servers.

As shown in Fig. 1, three main tasks are performed in Web

Usage Mining; Preprocessing, Pattern Discovery and Pattern Analysis. Fig.1

represents a brief description about the main task of Web Usage

Mining process.

6

Association Rules and Apriori Algorithm

The problem of deriving Association Rules from data was

first formulated in (Agrawal, Imielinski and Swami, 1993) and is

called the “market-basket problem”. The problem is that we are

given a set of items and a large collection of transactions which

are sets (baskets) of items. The task is to find relationships

between the containments of various items within those baskets.

Apart from the supermarket scenario there are many other

examples where Association Rules have been used, for example

users’ visits of WWW pages which the structure and its content

can be optimized. Xue et al., (2001) have used re-ranking method

and generalized Association Rules to extract access patterns of

the Web sites pattern usage. Mannila et al., (1999) use page

accesses from a Web server log as events for discovering frequent

episodes. Chen et al., (1996) introduce the concept of using the

maximal forward references in order to break down user sessions into

transactions for the mining of traversal patterns. Batista and

Silva, (2001) perform mining process for online newspaper Web

access logs by using Apriori algorithm.

7

The task in Association Rules mining involves finding all

rules that satisfy user defined constraints on minimum support

and confidence with respect to a given dataset. Most commonly

used Association Rule discovery algorithm that utilizes the

frequent itemset strategy is exemplified by the Apriori algorithm

(Agrawal et al., 1993).

Apriori was the first scalable algorithm designed for

association-rule mining algorithm. Apriori is an improvement over

the AIS and SETM algorithms (Agrawal and Srikant, 1994). The

Apriori algorithm searches for large itemsets during its initial

database pass and uses its result as the seed for discovering

other large datasets during subsequent passes. Rules having a

support level above the minimum are called large or frequent

itemsets and those below are called small itemsets (Chen et al.,

1996). The algorithm is based on the large itemset property which

states: Any subset of a large itemset is large and if an itemset is not large and then

none of its supersets are large (Agrawal and Srikant, 1994).

Web Usage Mining in Educational Field

In a Web-based learning environment, where both the tutors

and learners are separated spatially and physically, student

modeling is one of the biggest challenges. Traditional student

modeling techniques are inapplicable in these systems when tutors

8

are overwhelmed by the huge volumes of sequential data generated

as learners browse through the Web pages (Agrawal and Srikant,

1995). Web mining techniques, including clustering and

Association Rules mining can be applied to extract hidden and

interesting knowledge to facilitate instructional planning and

student diagnosis. Web mining in education is not new. It has

been applied to mine aggregate paths for learners engaged in a

distance education environment (Ha, Bae and Park, 2000); relevant

words to students based on text mining from their browsed

documents (Ochi et al., 1998); e-articles for students based on

key-word-driven text mining (Tang et al., 2000), and to analyze

learners’ learning behaviors (Zaiane and Luo, 2001). The previous

research proposed the beyond usage mining to consider the content

of the pages that have been visited. In the E-learning system,

both learners’ browsing behaviors and course content are

important to derive learners’ learning levels, intentions, goals,

interests or abilities. Incorporating course content can aid in

an understanding of learners’ browsing habits. In particular,

understanding the learners’ browsing behaviors can facilitate,

the course contain personalization.

The existing system called Artificial intelligence in

Education (AIED), employs a knowledge base, a student model and

9

instructional plans. For a Web based AIED system, Web mining

becomes part of student modeling. The system can relate its mined

knowledge of page contents and student navigation patterns to

students’ level of understanding to decide upon appropriate

feedback to them (Tang et al., 2001).

3.0 PROJECT METHODOLOGY AND IMPLIMENTATION

The Web Usage Mining process proposed by (Srivastava et al.,

2000) becomes a major guide line upon project implementation.

Fig.3 shows the general flow of the project methodology.

Server Log File

The server log file dated from 19 February 2004 until 13

March 2004 has been selected for further analysis. The server log

files are retrieved from the UUM Educare server, www.e-

web.uum.edu.my . The total amount of the server log file between

10

http://www.e-web.uum.edu.my/

http://www.e-web.uum.edu.my/

that duration is about 650 MB and the large amount of data

becomes the most challenging problem to handle during the Data

Preprocessing phase. The server log file consists of nine

attributes in the single line of record as shown in Fig 4.

In the data selection phase, log files started on 19

February 2004 until the end of semester (13 March 2004) have been

selected. The selection of the server log files must be done

carefully because of the UUM Educare is part of the GroupWeb

facilities where beside the Educare (My Desktop) as an E Learning,

there are several facilities such as My Portfolio and Resources.

Because of the UUM Educare facilities that provided by the

GroupWeb is mixed with another facilities or options, the server

log file also includes the mix of log file for every transaction

between the facilities in the GroupWeb portal.

Data Preprocessing

Data Preprocessing phase is one of the most challenging

phase in this study. The major task in this phase are includes

11

handling missing values, identifying outliers, smooth out noisy

data and correct inconsistent data (Han and Kamber, 2001). Data

Preprocessing consists of all the actions taken before the actual

Pattern Analysis phase process starts.

The Data Preprocessing phase is being done by using

available software in the market. On early stage of this phase,

Macro tool in Microsoft Access have been selected to assist the

preprocessing tasks and for the following data preprocessing

task, filter tool in Microsoft Excel becomes the selected tool.

The selection of this period is because of the universities

academic calendar shows the selected dates are nearly to the end

of second semester of 2003/2004 session where 13 March 2004 the

last day for final examination. Fig. 5 shows the data after

preprocessing phase is done.

Pattern Analysis

During the Pattern Analysis phase, the descriptive method is

being used analyze the data such as general summary of the Web

12

usage and customer behaviors. This general summary includes the

most active users using the portal either from Malaysia or other

country. If the users came from Malaysia, it’s also shows the

locality of the users either accessing the UUM Educare portal

from the UUM Local Area Network (LAN) or outside of UUM campus.

The analysis also tries to find out the top visitors for

each facility or option that being provided by the UUM Educare

portal. There are several facilities or option that being placed

in UUM Educare portal such as dms, profile, resources, announcement,

assessment, calendar, pnotes, assignment and forum. The dms as one of the

options in UUM Educare can be analyzed to know the most requested

documents in UUM Educare portal. Beside the dms option analysis,

the sever log files also trace the information of documents that

was downloaded.

Pattern Discovery – Association Rules

Given a server log files that represent UUM Educare portal

activities, the main purpose of Association Rules is to generate

all Association Rules that have support and confidence greater

than the user specified minimum support (called min_sup) and

minimum confidence (called min_conf) respectively. An algorithm

for finding all Association Rules, henceforth, referred to as the

Apriori algorithm (Agrawal and Srikant, 1994).

13

The selected of Apriori algorithm is because of the

performance where it able to run the mining process in short

period. Currently, Apriori algorithm is commonly used for

generating the Association Rules for Web Usage Mining and this

experimental study focus on exploratory of Web Usage Mining in

university E-Learning portal (UUM Educare).

Results

As stated above, this study will focus on Web Usage Mining

of UUM Educare portal. The results of this study are divided into

two sections where the first section will discuss about the

general descriptions of the access pattern and users behaviors of

UUM Educare portal (descriptive statistic). Another section will

display the supports and confidences of the different level in

UUM Educare portal. All the results will display using certain

chart for such as pie and bar chart to make it easier understand.

4.0 FINDINGS AND RESULTS

The Web Usage Mining for Universiti Utara Malaysia E

Learning (UUM Educare) portal where the main URL, www.e-

web.uum.edu.my are divided into two main stages or section. Each

stages having their own phases with certain sub activities. The

first stages are including log data retrieving from the UUM

14

Educare server where the Data Selection and Data Preprocessing

phases are directly involves. The second stages are the mining

stages where its will involving Pattern Discovery by applying

Association Rules and Pattern Analysis phases in order to

discovered the UUM Educare portal usage pattern.

General Pattern Analysis Results (access pattern and users behaviors –

descriptive statistic)

The UUM Educare portal has several options (dms, resources,

announcement, assessment, calendar and forum) that can be chooses by the

users. Based on the Universal Resource Locator (URL) stem, the

users only accessed the UUM Educare options in host of www.e-

web.uum.edu.my without navigating other sub options provided by

UUM Educare portal are shows on the figure below.

15

Association Rules Results (support and confidence of the different options)

Figure below shows the results that represent the support

and confidence for each option that being provided at UUM Educare

portal where the main host, www.e-web.uum.edu.my and continue

with main URL path for E Learning portal. There are 14 options

being provided on UUM Educare and for this analysis, the total

transactions for the options path are 10 578 transactions are

selected.

Based on the Fig.6 above, it can be conclude that, /main

path will be the most requested page and it followed by the /dms

path where the documents downloading are can done here. /main

path is the top level for UUM Educare and it display the basic

information about the portfolio/subject includes subject area,

discipline, owner and subject description. With /main and other

options path, user also can select other options that provided by

UUM Educare portal.

16

Fig. 7 shows that support and confidence for each

directories and the /dms option path with 36.45% percentage of

support and confidence where it is a highest percentage for

support and also for the confidence level. It’s followed by /main

and /assessment option path where the support and confidence is

15.02% and 13.05%.

Association Rules induction is the extraction of rules in

the form of X => Y (if X then Y) quantified with a confidence

(proportion of occurrences that verifies Y among occurrences that

verifies X) and a support (proportion of occurrences that

verifies X and Y among all occurrences).

A next result shows the Association Rules, including support

and confidence by applying Apriori algorithm for identifying the

patterns, defining a threshold of 15% for the minimum support and

a threshold of 70% for the minimum confidence. Fig. 8 shows the

pages that related to each other where the most frequent options

that being selected during the certain options is selected.

17

Based on the figure above, it can conclude that the rule

with higher support (22.0%) means, “if in a session the user

selected the /announcement and /main options path, user also

selected the /dms option path”; the rule with higher confidence

(99.1%) says that “if in a session, the user selected the

/announcement and /dms option path, user also selected the /main

option path”. Next figure (Fig. 9) represent a graphical chart

for the 6 most accepted rules for options relationship.

Fig. 10 below show the Association Hyperedges for UUM

Educare that represents the portal pages those orderly archived.

A threshold of 10% for the minimum support and a threshold of 75%

for the minimum confidence are being used. It shows that support

of 11.7% is the high percentage of transactions that contain all

18

items appearing in the hyperadge, that is in the /assignment

/announcement /main with the percentages of confidence is 78.1%.

The confidence of 92.2% with 10.6% of support is on

/assignment /announcement /assessment /main is represent the

highest of average confidence of all rules that can be formed

using the items in the hyperedge with all items appearing in the

rule (average confidence of the rules including /main <-

/assessment /announcement /assignment, /assessment <- /main

/announcement /assignment, /announcement <- /main /assessment

/assignment and /assignment <- /main /assessment /announcement.

The following figure (Fig. 11) shows 6 most accepted

Association Rules Hyperedge for UUM Educare server log files

during selected period of time.

19

5.0 PROJECT SIGNFICANCE

Generally, this project will produce the useful finding for

analyzing the Web usage pattern for UUM ELearning, www.e-

web.uum.edu.my and more specific:

i. This study will become the first step for the analyzing

university E-Learning portal by applying Web usage mining

approach with basic Association Rules – Apriori algorithm.

ii. The outcomes from this study can be used by the Web

administrator in order to plan necessary improvement,

20

enhancement and valuable actions to the university E-

Learning portal.

iii. The implementation of Web usage mining process for

university E-Learning portal may becomes the guide line for

the system development purposes.

6.0 CONCLUSION AND RECOMMENDATION

Web Usage Mining is an active field for research and it will

generate new hopes in internet based business. Web Usage Mining

applications are being used in some famous Websites and this

paper totally focuses on education field. This paper presents a

brief introduction of Web mining technique, apart of the data

mining technologies and also the implementation of the Web Usage

Mining in E-Learning portal. Server log files of UUM E-Learning

(UUM Educare) in server host, www.e-web.uum.edu.my have been

selected for this project. In order to perform the Web Usage

Mining, the methodology that being introduce by (Srivastava et al.,

2000) becomes major guide where it includes three main phases;

Data Preprocessing, Pattern Analysis and Pattern Discovery. All the particular

phases were done carefully to produce quality results. Data

Processing phase for the Web Usage Mining is a challenging task

and basic Association Rules algorithm, Apriori algorithm was

21

selected as a technique to produce the support and confidence of

the different levels in Web usage mining of UUM Eduacare portal.

The selection of the Apriori algorithm for performing Web

usage mining on UUM Educare portal is because of Apriori algorithm

is a common data mining technique for association based analysis.

By applying this algorithm to the Web log file, the relationship

between the accessed pages can be mined. The Web usage patterns

and user behavior on UUM Educare portal also can analyze by using

this algorithm where the descriptive statistic approach can’t

perform this analysis. The results and findings for this analysis

are more reliable but less of accuracy because of the Apriori

algorithm properties where the same selected itemsets are always

counted. The results or findings from this experimental analysis

are surely useful for Web administrator in order to improve Web

services and performance through the improvement of Web sites,

including their contents, structure, presentation and delivery.

The valuable actions may contain of performing the Web pages

value added modification.

As a recommendation, in order to enhance and continue this

project, the suggested methodology can be implemented for system

development purposes. The system may perform and implement the

Web usage mining phase including data selection, data

22

preprocessing, pattern discovery and analysis. Apriori algorithm

may be a part of the pattern discovery sub function.

Beside that, there are certain important recommendation may

propose for improving the results during project implementation.

As discussed in chapter V, Findings and Results, there is placed

there the Association Rules with percentages support and

confidence for 10 most requested options in UUM Educare portal

and also represent the Association Rules for /dms option. The

figures for each support and confidence in particular results are

same because the step calculation for Apriori algorithm. According

to Boon Lay et., al (1999), the percentages of support for

Association Rules can be improve by finding all sets items

(attribute=value) that have transactions support above minimum

support. Itemsets with minimum support are called large itemsets

and the large itemsets are used to generate the desired rules.

Lastly, for future work, the another method for analyzing

sparse data can be used in the study of E-Learning Web log

access, use of different similarity Association Rules and

conclude about the most suitable alternatives for knowledge

extraction from Web log data.

REFERENCES

23

Abd. Wahab, M. H, Siraj, F and Yusoff, N. (2004). Log Mining Using

Generalize Association Rules. In Proceedings of Master Final

Project 2004 Presentation, UUM, Malaysia.

Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining Association

Rules between Sets of Items in Large Databases. In Proceedings of the

International ACM SIGMOD Conference, Washington DC, USA,

pages 207–216.

Agrawal, R. and Srikant, R. (1994). Fast Algorithm for Mining Association

Rules. Proc. of the 20th VLDB Conference. Pp 487-499.

Agrawal, R., and Srikant, R. (1995). Mining Sequential Patterns. In

Proc. of the Eleventh International Conference on Data

Engineering (ICDE), Taiwan. Pp 3-14.

Batista, P and Silva, M (2001). Prospeccao dos Dados de Acesso a um

Servidor de Noticias na Web, 2nd Coferencia sobre Redes de

Computadores, Evora, Portogal.

Boon Lay, C, Khalid, M and Yusof, R. (!999). Intelligent Database

by Neural Network and Data Mining. In Proc. of Artificial

24

Intelligent Applications in Industry, Kuala Lumpur. Pp 201-

219.

Borgelt, C. (2004). Apriori: Finding Association Rules/Hyperedges with the

Apriori Algorithm. School of Computer Science, University of

Magdeburg.

Chen, M.-S., Jan, J., Yu, P.S. (1996). Data Mining: An Overview from a

Database Perspective. IEEE Transactions on Knowledge and Data

Engineering, (8:6). Pp 866.883.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, R. (1996). The KDD

Process for Extracting Useful Knowledge from Volumes of Data.

Communications of the ACM, (39:11). Pp 27-34.

Han, J., Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan-

Kaufmann Academic Press, San Francisco.

Jiang, Q. (2003). Web Usage Mining: Process and Application. Presentation

for CSE 8331.

25

Kosala, R., Blockeel, H. (2000). Web Mining Research: A Survey. ACM

SIGKDD (Special Interest Group on Knowledge Discovery and

Data Mining) Explorations. June, (2:1). Pp 1-10.

Mannila, H., Toivonen, H. and Verkamo, A. I. (1994). Efficient

Algorithms for Discovering Association Rules. In AAAIWorkshop on

Knowledge and Discovery in Databases, Seattle, Washington,

USA, Pp 181–192.

Srivasta, J., Cooley, R., Deshpande, M., and Tan P. N. (2000).

Web Usage Mining: Discovery and Application of Web Usage Pattern from Web

Data. Department of Computer Science and Engineering,

University of Minnesota.

Tang, C.; Lau, R.W.H.; Li, Q.; Yin, H.; Li, T.; and Kilis, D.

(2000). Personalized Courseware Construction Based on Web Data Mining.

In Proc. of the First International Conference on Web

Information Systems Engineering (WISE 2000) vol.2, Pp. 204-

211.

26

Tang, Y. T. and McCalla, G. (2001). Student modeling for a Web based

Learning Environment: a Data Mining Approach. Department of Computer

Science, University of Saskatchewan, Canada.

Xue, G. R., Zeng, H. J., Ma, W. Y and Lu, C. J. (2002). Log Mining

to Improve the Performance of the Methods from statistic, Neural Nets, Machine

Learning and Experts System. Morgan Kaufman.

Zaiane, O. and Luo, J. (2001). Towards Evaluating Learners’ Behavior in a

Web-based Distance Learning Environment. In Proc. of IEEE

International Conference on Advanced Learning Technologies,

Pp 357-360, Madison, WI.

27

Documents

WEB USAGE MINING USING APRIORI ALGORITHM UUM LEARNING CARE