Web Minning

WEB MINING

1. Starter 22. A Taxonomy of Web Mining 3 2.1 Web Content Mining 3 2.1.1 Web Crawler 4 2.1.2 Harvest System 7 2.1.3 Virtual Web View 7 2.1.4. Personalization 73. Web Structure mining 8 3.1 Page Rank 9 3.1.1 Important Pages 104 Web Usage Mining 145. The WEBMINER system 15 5.1 Browsing Behavior Models 16 5.2 Developer’s Model166. Preprocessing 177. Data Structures 198. Finding Unusual Itemsets 21 8.1 The DICE Engine 21 8.2 Books and Authors 22 8.3 What is a pattern? 23 8.4 Data Occurrences 24 8.5 Finding Data Occurrences Given Data 24 8.6 Building Patterns from Data Occurrences 24 8.7 Finding Occurrences Given Patterns 25

SUSHIL KULKARNI 2

[email protected]

1.Starter

On the World Wide Web there are billions of documents that are spread over millions ofdifferent web servers. The data on the web is called web data and is classified asfollows:

(1) Content of Web pages. These pages have the following structures:

(a) Intra-page structures that include HTML or XML code for the page. (b) Inter-page structure that includes the actual linkage of different web pages.

(2) Usage data that describe how web pages are accessed by visitors.

(3) Usage profiles gives the characteristics of visitors that include demographics,psychographics, and technographics.

Demographics are tangible attributes such as home address, income, purchasingresponsibility, or recreational equipment ownership.

Psychographics are personality types that might be revealed in a psychological survey,such as highly protective feelings toward children (commonly called "gatekeepermoms"), impulse-buying tendencies, early technology interest, and so on.

Technographics are attributes of the visitor's system, such as operating system,browser, domain, and modem speed.

With the explosive growth of information sources available on the World Wide Web, ithas become increasingly necessary for users to utilize automated tools in order to find,extract, filter, and evaluate the desired information and resources. In addition, with thetransformation of the Web into the primary tool for electronic commerce, it is imperativefor organizations and companies, who have invested millions in Internet and Intranettechnologies, to track and analyze user access patterns. These factors give rise to thenecessity of creating server-side and client-side intelligent systems that can effectivelymine for knowledge both across the Internet and in particular Web localities.

Web mining can be broadly defined as:

The discovery and analysis of useful information from the World Wide Web.

This definition describes the automatic search and retrieval of information and resourcesavailable from millions of sites and on-line databases. This is called Web contentmining, and the discovery and analysis of user access patterns from one or more Webservers or on-line services is called Web usage mining. The structure of the Weborganization is model using Web structure mining.

In this chapter, we provide an overview of tools, techniques, and problems associatedwith the three dimensions above.

mailto:[email protected]

SUSHIL KULKARNI 3

[email protected]

There are several important issues, unique to the Web paradigm, that come into play ifsophisticated types of analyses are to be done on server side data collections. Theyinclude:

oo The necessity of integrating various data sources such as server access logs

oo Referrer logs, user registration or profile information

oo Resolving difficulties in the identification of users due to missing unique keyattributes in collected data

oo The importance of identifying user sessions or transactions from usage data, sitetopologies, and models of user behavior.

2. A Taxonomy of Web Mining

In this section we present taxonomy of Web mining along its three primarydimensions, namely Web content mining, Web structure mining and Web usagemining. This taxonomy is depicted in the following figure.

2.1 Web Content Mining

The World Wide Web contains the information sources, that is heterogeneous andwithout a structure such as hypertext, extensible mark up documents. In fact it isdifficult to locate Web-based information automatically as well as to organize andmanage.

Web Content mining is helpful for retrieving the pages, locating and ranking relevantweb pages, browsing through relevant and related web pages. Extracting and gathering

Web UsageMining

Web ContentMining

Web StructureMining

Web PageContent Mining

Search Result Mining

General AccessPattern Tracking

CustomizedUsage Mining

Web Mining


SUSHIL KULKARNI 4

[email protected]

the information from the web pages

Traditional search and indexing tools of the Internet and the World Wide Web such asLycos, Alta Vista, WebCrawler, ALIWEB, MetaCrawler, and others provide somecomfort to users, but they do not generally provide structural information norcategorize, filter, or interpret documents. Here we see few tools which are commonlyused.

2.1.1 Web Crawler

Let us concentrate to search the Web pages. The significant problems to search aparticular web page is as follows are as follows:

(a) Scale: The Web grows at a faster rate than machines and disks.

(b) Variety: The Web pages are not always documents.

(c) Duplicates: There are various Web pages that are mirrored or copied.

(d) Domain Name Resolution: A symbolic address is mapped to an IP address andtakes long time to invoke.

To tackle this problem, one can use a program called web crawler.

A web crawler (or robot or spider) is a program, which automatically traverses theweb by downloading documents and following links from page to page. The page fromwhere the crawler starts is called seed URL’s. The links from this page are stored in aqueue for search engine. These new pages are searched and their links are stored in aqueue for search engine.

While the crawler moves downwards and collects the information about each page, thatincludes extract keywords and stores in indices for users for associated search engine.Web crawlers are also known as spiders, robots, worms etc.

The design of web crawler is implemented in Java and designed to scale to tens ofmillions of web pages. The main components of the system are a URL list, downloader,link extractor etc.

The crawler was designed so that at most one worker thread will download from a givenserver. This was done to avoid overloading any servers. The crawler uses user-specifiedURL filters (domain, prefix, and protocol) to decide whether or not to downloaddocuments. It is possible to use the conjunction, disjunction or negation of filters.

Crawler are of different types and are discussed below:

a. Traditional Crawler: A traditional crawler visits entire Web and gathers theinformation and builds the indices by replacing the existing index.


SUSHIL KULKARNI 5

[email protected]

b. Periodic Crawler: A periodic crawler visits a specific number of pages and stop. Itbuilds the index and replaces the existing index. This crawler is activated periodically.

c. Incremental Crawling: An incremental crawler is one which updates an existing setof downloaded pages instead of restarting the crawl from scratch each time. Thisinvolves some way of determining whether a page has changed since the last time itwas crawled.

d. Focused Crawling: A general-purpose web crawler normally tries to gather as manypages as it can from a particular set of sites. In contrast, a focused crawler is designedto only gather documents on a specific topic, thus reducing the amount of networktraffic and downloads.

Searching hypertext documents is based on depth-first search algorithm. This algorithmuses the school of fish metaphor with multiple processes or threads following linksfrom pages. The "fish" follow more links from relevant pages, based on keyword andregular expression matching. This type of system can have heavy demands on thenetwork, and has various caching strategies to deal with.

Here the crawler starts by using canonical topic taxonomy and user specified startingpoints (e.g. bookmarks). A user marks interesting pages as they browse, which are thenplaced in a category in the taxonomy.

The main components of the focused crawler were a classifier, distiller and crawler.The classifier makes relevance judgements on pages to decide on link expansion andthe distiller determines centrality of pages to determine visit priorities. This is based onconnectivity analysis using harvest ratio, which is the rate at which relevant pages areacquired, and how effectively irrelevant pages are filtered away.

To use focused crawler, the user identifies the name of the topic that (s) he is lookingfor. While the user browsers on the Web, (s) he identifies the documents that are ofinterest. These are then classified based on hierarchical classification tree and node inthe tree are marked as good, thus indicating that this node in the tree has associatedwith it document(s) that are of interest. These documents are then used as the seeddocuments to begin the focused crawling. During crawler phase relevant documents arefound and determine whether it makes sense to follow the links out of these documents.Each document is classified into a leaf node of the taxonomy tree.

In recent years intelligent tools are developed for information retrieval, such asintelligent Web agents, as well as to extend database and data mining techniques toprovide a higher level of organization for semi-structured data available on the Web. Wesummarize these efforts below.

[A] Agent Based Crawling

Many software agents are developed for web crawling on the Internet, or created act asbrowsing assistants.

The InfoSpiders system (formerly ARACHNID) is based on ecology of agents, which


SUSHIL KULKARNI 6

[email protected]

search through the network for information. As an example, a user's bookmarks couldbe used as a start point, with the agents then analyzing the "local area" around thesestart points. Link relevance estimates are used to move to another page, with agentsbeing rewarded with energy (credit) if documents appear to be relevant. Agents arecharged energy costs for using network resources, and use user assessments if adocument has been previously visited. If an agent moves off-topic it will eventually dieoff due to loss of energy.

[B] Database Approach

The database approaches to Web mining have generally focused on techniques forintegrating and organizing the heterogeneous and semi-structured data on the Web intomore structured and high-level collections of resources, such as in relational databases,and using standard database querying mechanisms and data mining techniques toaccess and analyze this information.

(i) Multilevel Databases

The main idea behind it is that the lowest level of the database contains primitive semi-structured information stored in various Web repositories, such as hypertext documents.At the higher level(s) meta-data or generalizations are extracted from lower levels andorganized in structured collections such as relational or object-oriented databases.

(ii) Web Query Systems

There have been many Web-base query systems and languages developed recently thatattempt to utilize standard database query languages such as SQL, structuralinformation about Web documents, and even natural language processing foraccommodating the types of queries that are used in World Wide Web searches. Wemention a few examples of these Web-base query systems here.

W3QL: combines structure queries, based on the organization of hypertext documents,and content queries, based on information retrieval techniques.

WebLog: Logic-based query language for restructuring extracted information from Webinformation sources.

Lorel and UnQL: query heterogeneous and semi-structured information on the Webusing a labeled graph data model.

TSIMMIS: extracts data from heterogeneous and semi-structured information sourcesand correlates them to generate an integrated database representation of the extractedinformation.

WebML: This is a query language to access the documents using data miningoperations and list of operations based on the use of concept hierarchies for thefollowing keywords.

1. COVERS: One concept covers another.


SUSHIL KULKARNI 7

[email protected]

2. COVERED BY: This is reverse of above.3. LIKE: One concept is similar to another.4. CLOSE TO: One concept is closed to another.

Following is the example of WebML:

SELECT *FROM document in www.ed.smm.eduWHERE ONE OF keywords COVERS “dog”

2.1.2 Harvest System

This system is based on the use of catching, indexing and crawling. This system is a setof tools that is used to collect information from different sources. It is designed on thebasis on collector and brokers.

A collector obtains the information for indexing from an Internet service provider, whilea broker provides the index and query interface. The relationship between collector andbroker can very. Broker may interface directly with collector or may go through otherbrokers to get to the collectors. Indices in Harvest are topic specific as are brokers.

2.1.3 Virtual Web View

This approach is based on the database discussed earlier.

2.1.4. Personalization

With Web personalization, users can get more information on the Internet fasterbecause Web sites already know their interests and needs. But to gain this convenience,users must give up some information about themselves and their interests — and giveup some of their privacy. Web personalization is made possible by tools that enable Websites to collect information about users.

One of the ways this is accomplished is by having visitors to a site fill out forms withinformation fields that populate a database. The Web site then uses the database tomatch a user's needs to the products or information provided at the site, withmiddleware facilitating the process by passing data between the database and the Website.

An example is Amazon.com Inc.'s ability to suggest books or CDs users may want topurchase based on interests they list when registering with the site.

Customers tend to buy more when they know exactly what's available at the site andthey do not have to hunt around for itCookies may be the most recognizable personalization tools. Cookies are bits of codethat sit in a user's Internet browser memory and tell Web sites who the person is —that's how a Web site is able to greet users by name.


http://www.ed.smm.edu

SUSHIL KULKARNI 8

[email protected]

A less obvious means of Web personalization is collaborative-filtering software thatresides on a Web site and tracks users' movements. Wherever users go on the Internet,they can't help but leave footprints. And software is getting better at reading the pathsusers take across the Web to discern their interests and viewing habits: from theamount of time they spend on one page to the types of pages they choose.

Collaborative-filtering software compares the information it gains about one user'sbehavior against data about other customers with similar interests. In this way, usersget recommendations like Amazon's "Customers who bought this book also bought. . ."

These are "rules-based personalization systems" If you have historical information,you can buy data-mining tools from a third party to generate the rules. Rules-basedpersonalization systems are usually deployed in situations where there are limitedproducts or services offered, such as insurance and financial institutions, where humanmarketers can write a small number of rules and walk away.

Other personalization systems, such as Andromedia LikeMinds, emphasize automatic realtime selection of items to be offered or suggested. Systems that use the idea that"people like you make good predictors for what you will do" are called "collaborativefilters." These systems are usually deployed in situations where there are many itemsoffered, such as clothing, entertainment, office supplies, and consumer goods. Humanmarketers go insane trying to determine what to offer to whom, when there arethousands of items to offer. As a result automatic systems are usually more effective inthese environments. Personalizing from large inventories is complex, unintuitive, andrequires processing huge amounts of data.

Let us consider the example, Ms. Heena Mehta purchase the items by visiting onlineshopping through the web site ABC.com. She first log in using an ID. This ID is useful tokeep track what she purchase as well as which pages she visits. Data mining tools of theABC.com is used to develop detailed user profile for Heena using the purchases and webusage data. This profile is used later on to display an advertisement. For example,suppose Heena purchase bulk of chocolates last week and today she log in and goes tothe page that contains the attractive Barbie dolls for buying. While looking at the pagesABC shows a banner ad about some special sale on milk chocolates. Heena can’t resists.She immediately follows the link and adds the chocolates to her shopping list. Shereturns to the page with the Barbie she wants.

3. Web Structure mining

Web structure mining can be viewed as a model of the Web organization or a portionthereof. This can be used to classify web pages or to create similarity measures betweenthe documents. We have already seen some structure mining ideas presented in theprevious article. These approaches used structure to improve on the effectiveness ofsearch engine and crawlers.

Following are two techniques used for structure mining.


SUSHIL KULKARNI 9

[email protected]

3.1 Page Rank

PageRank is one of the methods Google uses to determine a page’s relevance orimportance. The pagerank value for a page is calculated based on the number of pagesthat point to it. This is actually a measure based on the number of backlinks to a page.

PageRank is displayed on the toolbar of your browser if you’ve installed the Googletoolbar (http://toolbar.google.com/). But the Toolbar PageRank only goes from 0 – 10and seems to be something like a logarithmic scale:

Toolbar PageRank(log base 10)

Real PageRank

01234

0 – 100100 - 1,000

1,000 - 10,00010,000 - 100,000

and so on...

Following are some of the terms used:

(1) PR: Shorthand for PageRank: the actual, real, page rank for each page ascalculated by Google.

(2) Toolbar PR: The PageRank displayed in the Google toolbar in your browser. Thisranges from 0 to 10.

(3) Backlink: If page A links out to page B, then page B is said to have a “backlink”from page A.

We can’t know the exact details of the scale because the maximum PR of all pages onthe web changes every month when Google does its re-indexing! If we presume thescale is logarithmic then Google could simply give the highest actual PR page a toolbarPR of 10 and scale the rest appropriately.

Thus the question is “What is PageRank?”. So let’s answer it.

PageRank is a “vote”, by all the other pages on the Web, about how important a pageis. A link to a page counts as a vote of support. If there’s no link there’s no support (butit’s an abstention from voting rather than a vote against the page).

The another definition given by Google is as follows:

We assume page A has pages T1...Tn which point to it (i.e., are citations). Theparameter d is a damping factor,which can be set between 0 and 1. We usually set d to0.85. Also C(A) is defined as the number of links going out of page A.

The PageRank of a page A is given as follows:


http://toolbar.google.com/

SUSHIL KULKARNI 10

[email protected]

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum ofall web pages' PageRanks will be one.

Let us break down PageRank or PR(A) into the following sections:

1. PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)” forthe first page in the web all the way up to “PR(Tn)” for the last page

2. C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links.The count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for pagen, and so on for all pages.

3. PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the shareof the vote page A will get is “PR(Tn)/C(Tn)”

4. d(... - All these fractions of votes are added together but, to stop the otherpages having too much influence, this total vote is “damped down” bymultiplying it by 0.85 (the factor “d”)

3.1.1 Important Pages

A page is important if important pages link to it. Following is the method of page rank.

Create a stochastic matrix of the Web; that is:

1. Each page i corresponds to row i and column i of the matrix.2. If page j has n successors (links), then the i j th entry is 1/n if page i is one of these

n successors of page j, and 0 otherwise.

The goal behind this matrix is:

oo Imagine that initially each page has one unit of importance. At each round, eachpage shares whatever importance it has among its successors, and receives newimportance from its predecessors.

oo Eventually, the importance of each page reaches a limit, which happens to be itscomponent in the principal eigen-vector of this matrix.

oo That importance is also the probability that a Web surfer, starting at a random page,and following random links from each page will be at the page in question after along series of links.

Let us consider few examples:


SUSHIL KULKARNI 11

[email protected]

Example 1: Assume that the Web consists of only three pages say, Netscape,Microsoft, and Amazon. The links among these pages were as shown in the followingfigure.

Let [n; m; a] be the vector of important for the three pages: Netscape, Microsoft,Amazon, in that order. Then the equation describing the asymptotic values of thesethree variables is:

For example, the first column of the matrix reflects the fact that Netscape divides itsimportance between itself and Amazon. The second column indicates that Microsoftgives all its importance to Amazon. The third column indicates that Amazon gives all itsimportance to Netscape and Microsoft.

We can solve equations like this one by starting with the assumption n = m = a = 1,and applying the matrix to the current estimate of these values repeatedly. The first fouriterations give the following estimates:

n = 1 1 5/4 9/8 5/4 m = 1 1/2 3/4 1/2 11/16 a = 1 3/2 1 11/8 17/16

In the limit, the solution is n = a = 6/5 ; m = 3/5. That is, Netscape and Amazon eachhave the same importance, and twice the importance of Microsoft.

Note that we can never get absolute values of n, m, and a, just their ratios, since theinitial assumption that they were each 1 was arbitrary.

N

MAm

=

amn

amn

012/12/1002/102/1


SUSHIL KULKARNI 12

[email protected]

Since the matrix is stochastic (sum of each column is 1), the above relaxation processconverges to the principal eigen vector.

Following are the problems that are faced by on the Web

a. Dead ends: a page that has no successors has nowhere to send its importance.Eventually, all importance will “leak out of" the Web.

b. Spider traps: a group of one or more pages that have no links out of the group willeventually accumulate all the importance of the Web.

Let us consider the following example,

Example 2: Suppose Microsoft tries to duck charges that it is a monopoly by removingall links from its site. The new Web is as shown in the following figure.

The matrix describing transitions is:

And the Microsoft becomes a dead end, as the second column entries are all zeros.

The first four steps of the iterative solution are:

n = 1 1 3/4 5/8 1/2 m = 1 1/2 1/4 1/4 3/16 a = 1 1/2 1/2 3/8 5/16

Eventually, each of n, m, and a become 0; i.e., all the importance leaked out.

N

MAm

=

amn

amn

002/12/1002/102/1


SUSHIL KULKARNI 13

[email protected]

Example 3: Angered by the decision, Microsoft decides it will link only to itself fromnow on. Now, Microsoft has become a spider trap. The new Web is in the followingfigure,

The matrix describing transitions is:

and the equation to solve is:

n = 1 1 3/4 5/8 1/2

m = 1 3/2 7/4 2 35/16

a = 1 1/2 1/2 3/8 5/16

Now, m converges to 3, and n = a = 0. 2

Following is the Google Solution to Dead Ends and Spider Traps

Instead of applying the matrix directly, one can apply “tax" to each page. The tax is thefraction of its current importance, and distributes the taxed importance equally amongall pages. Consider the following example,

Example 4: If we use a 20% tax, the equation of Example 3 becomes:

=

amn

amn

002/12/1102/102/1

N

MAm

+

=

2.02.02.0

002/12/1102/102/1

8.0amn

amn


SUSHIL KULKARNI 14

[email protected]

The solution to this equation is n = 7=11; m = 21=11; a = 5=11.

Note that the sum of the three values is not 3, but there is a more reasonabledistribution of importance than in Example 3.

The use of Page rank to measure importance, rather than the more naive “number oflinks into the page" also protects against spammers. The naive measure can be fooledby the spammer who creates 1000 pages that mutually link to one another, while Pagerank recognizes that none of the pages have any real importance.

4 Web Usage Mining

It is the Web mining activity that involves the automatic discovery of user accesspatterns from one or more Web servers. This activity is very important on the Internetvia World Wide Web to conduct business, the traditional strategies and techniques formarket analysis. Corporate generates and collects large volumes of data in their dailyoperations.

Web usage mining performs mining on the Web usage data or web logs. A Web log is alisting of page reference data. Web log is some times termed as clickstream databecause each entry corresponds to a mouse click. These logs can be examined fromeither a client perspective or a server perspective. When evaluated from a serverperspective, mining uncovers information about the sites where the service resides. Itcan be used to improve the design the sites. By evaluating a client’s sequence of clicks,information about the user9or group of users) is determined. This could be used toperform prefetching and catching of pages.

For example, the Webmaster of XYZ company found that a high percentage of usershave the following pattern of reference to pages: {A, B, A, D}. This means that a useraccesses page A then page B, then back to page A and finally to page D. Based on thisobservation, he determines that a link is needed directly to page D from B. He then addsthis link.

Web usage mining consists of three activity steps given below:

1. Preprocessing activities center around reformatting the Web log data beforeprocessing.

2. Pattern discovery activities form the major portion of the mining activities becausethese activities look to find hidden pattern within log data.

3. Pattern analysis is the process of looking at and interpreting the results of discoveryactivities.

We will learn these activities in the following articles. It should be noted that the webapplication is totally different from other traditional data mining application, such as“Goods Basket” model. We can interpret this problem from two aspects:


SUSHIL KULKARNI 15

[email protected]

1. Weak Relations between user and site:

Visitors could access the web site at any time from any place and even without any clearidea about what they want from the web. On the other hand, it is not easy for the siteto discriminate different users. WWW brings great freedom and convenience for usersand sites, and great varieties among them as well. So the relation between supply anddemand becomes weak and vague.

2. Complicated behaviours:

Hyperlink and back tracking are the two important characters in web environment,which make user’s activities more complicated. For the same visited contents, differentusers can access them with different patterns. Also, the user’s behaviors are recorded asvisiting sequence in web logs, which can not exactly reflect the user’s real behaviors andweb site structures.

5. The WEBMINER system

This system divides the Web Usage Mining process into three main parts, as shown inthe following figure.

Input data consists of the three server logs - access referrer, and agent, the HTML filesthat make up the site, and any optional data such as registration data or remote agentlogs. The first part of Web Usage Mining, called preprocessing, includes the domaindependent tasks of data cleaning, user identification, session identification, and pathcompletion.

Data cleaning is the task of removing log entries that are not needed for the miningprocess.

User identification is the process of associating page references, even those with thesame IP address, with different users. The site topology is required in addition to theserver logs in order to perform user identification.

Session identification takes all of the page references for a given user in a log andbreaks them up into user sessions. As with user identification, the site topology isneeded in addition to the server logs for this task.

Path completion fills in page references that are missing due to browser and proxyserver caching. This step differs from the others in that information is being added tothe log.

As shown in the figure, mining for association rules requires the added step oftransaction identification, in addition to the other preprocessing tasks. Transactionidentification is the task of identifying semantically meaningful groupings of pagereferences. In a domain such as market basket analysis, a transaction has a naturaldefinition - all of the items purchased by a customer at one time. However, the only“natural” transaction definition in the Web domain is a user session, which is often too


SUSHIL KULKARNI 16

[email protected]

coarse grained for mining tasks such as the discovery of association rules. Therefore,specialized algorithms are needed to redefine single user sessions into smallertransactions.

The knowledge discovery phase uses existing data mining techniques to generaterules and patterns. Included in this phase is the generation of general usage statistics,such as number of “hits” per page, page most frequently accessed, most commonstarting page and average time spent on each page. Association rule and sequentialpattern generation are the only data mining algorithms currently implemented in theWEBMINER system, but the open architecture can easily accommodate any data mining

or path analysis algorithm. The discovered information is then fed into various patternanalysis tools. The site filter is used to identify interesting rules and patterns bycomparing the discovered knowledge with the Web site designer’s view of how the siteshould be used, as discussed in the next section. As shown in Fig. 2, the site filter canbe applied to the data mining algorithms in order to reduce the computation time, or thediscovered rules and patterns.

5.1 Browsing Behavior Models

In some respects, Web Usage Mining is the process of reconciling the Web sitedeveloper’s view of how the site should be used with the way users are actuallybrowsing through the site.

5.2 Developer’s Model

The Web site developer’s view of how the site should be used is inherent in thestructure of the site. Each link between pages exists because the developer believes thatthe pages are related in some way. Also, the content of the pages themselves provides


SUSHIL KULKARNI 17

[email protected]

information about how the developer expects the site to be used. Hence, an integralstep of the preprocessing phase is the classifying of the site pages and extracting thesite topology from the HTML files that make up the web site. The topology of a Web sitecan be easily obtained by means of a site “crawler”, that parses the HTML files to createa list of all of the hypertext links on a given page, and then follows each link until all ofthe site pages are mapped.

The WEBMINER system recognizes five main types of pages:

Head Page - a page whose purpose is to be the first page that users visit, i.e. “home”pages.

Content Page - a page that contains a portion of the information content that the Website is providing.

Navigation Page - a page whose purpose is to provide links to guide users on tocontent pages.

Look-up Page - a page used to provide a definition or acronym expansion.

Personal Page - a page used to present information of a biographical or personalnature for individuals associated with the organization running the Web site.

Each of these types of pages is expected to exhibit certain physical characteristics.

6. Preprocessing

The tasks of data preparation before processing in web usage mining include thefollowing:

A. Collection of usage data for web visitors:

Most usage data are often recorded as kinds of web server logs. In some of the servicesneed user registration or usage data are recorded as other file format.

Let us first define clicks and logs as follows:

P is a set of literals, called pages or clicks. U is a set of users. A log is definedas a set of triplets given by {(u i, p i , t i): u i ∈ U, p i ∈ P} where t i is a timestamp.

Standard log data consists of source site, destination site and time stamp. Source site,destination site can be any URL or an IP address. This definition indicates that user IDlocates the source site and a page ID identifies the destination. The information aboutthe browser may be included in the above definition.

B. User identification:


SUSHIL KULKARNI 18

[email protected]

It is easy to identify different users and other registration situation, though it can not beavoided that some private personal registration information is misused by hackers. Butfor common web sites, it seems not easy to identify different users. In this situation,user can freely visit the web site. User’s IP, Cookies and other limited client information,such as agent and version of OS and browsers can be used for user identification. In thisstep, the usage data for different users are separately collected.

C. Session construction:

After user identification, different sessions for the same user should be reconstructedfrom this user’s usage data collected in the second step. A session is a visit performedby a user from the time (s)he enters the web site till (s)he leaves. Two time constraintsare needed for this reconstruction, one is that the duration for any session can notexceed a defined threshold; the other is that the time gap between any twocontinuously accessed pages can not exceed another defined threshold.

In web usage mining, time set, users set and web pages set are the three key entities,which are defined as a T, U and P. A session is a visit performed by a user from thetime when (s)he enters the web site to the time she leaves.

A session is a page sequence ordered by timestamp in usage data record and isdefined as S = <p1, p2 … pm> (pi ∈ P , 1 ≤ i ≤ m), and these pages can form anotherpage set S’ = {p’1, p’2 … p’k }, (p’i∈S , p’ i ≠p’ j , 1≤ i , j ≤ k).

A session is alternatively defined as follows:

Let L be a log. A session S is an ordered list of pages accessed by a user. i.e.S= {( p i , t i): u i ∈ U, p i ∈ P}, where there is a user u i ∈ U such that{(u i, p i , t i): u i ∈ U, p i ∈ P} ⊆ L

D. Behavior recovery:

Reconstructed sessions are not enough to depict the varieties of user navigationbehaviors in web usage mining. In most cases, any kinds of usage behaviors are onlyrecorded as a URL sequence in sessions. The revisiting and back tracking result in thecomplexity of user navigation, so the task of recovery aims to rebuild the real userbehavior from the linear URLs in sessions.

User behavior is recovered from the session for this user and defined as b = (S’, R),where R is the relations among S’ and all user behaviors bs form a behaviour set namedB.

Consider the following list of session:

S = <0, 292, 300, 304, 350, 326, 512, 510, 513, 512, 515, 513, 292, 319, 350, 517,286 >

In this session, the pages are labeled with IDs. User raised 17 page/times access


SUSHIL KULKARNI 19

[email protected]

requirements. 0 and 286 were accessed separately as entrance and exit pages. Besidesthe entrance and exit pages, there are two groups of pages, one group is (300, 304,326, 510, 513, 319, 517, 515), which were accessed only once, and the other group is(292, 350, 512) which were accessed more than once. Now we will explain severalstrategies for recovering different user behaviours.

Following is the simple Behaviours Recovery strategy:

This strategy is the simplest one and overlooks all the repeated pages in a session. Itincludes two kinds of behaviours. The first is that user behaviours are represented withonly those unique accessed pages, which is the simplest recovery strategy. So simpleuser behaviours can be recovered from this session as:

S’ = {0, 292, 300, 304, 350, 326, 512, 510, 513, 515, 319, 517, 286}.

The second method is that user behaviours are represented with those unique accessedpages and also the access sequence among these pages. For those pages accessedmore than once, we concern only the first happening. Based on this thinking, userbehaviours for this session can be recovered as:

<0 – 292 – 300 – 304 – 350 – 326 – 512 – 510 – 513 – 515 – 319 – 517 - 286>

From the user behaviours recovered by the first method, association rules and frequentitem sets can be mined in further step. Sequential patterns can be mined from the userbehaviours recovered by the second method.

7. Data Structures

The simple recovery strategies listed above play great importance in data mining.While in web usage mining, revisiting and backtracking are the two important charactersin user behaviours, which take place as the form that some pages were accessed morethan once during a session. Those pages accessed more than once led to differentaccess directions, which formed behaviours like tree structure. Tree structure behavioursnot only depicted the visiting patterns, but also revealed some conceptual hierarchy onsite semantics.

In the tree structure behaviour, each different page happens only once. To recoveraccess tree t from session s, we used a page set named P to store the unique pagesthat already exist in t, and we also used a pointer pr pointing to the last recoverednode during recovering in t. The recovery strategy is:

1. Set t = NULL;2. Read the first entrance page in s as the tree root r, let pr pointing to r and insert

this page to P;3. Read new page from s and judge if the same page exist in P; i. Exist in P: 4. Find this already existing node n in t and set pr point to this node, 5. Go to step 3.


SUSHIL KULKARNI 20

[email protected]

ii. Not exist in P: 4. Insert this new page to P, 5. Create a new node and insert this new node as a new child for pr, 6. Let pr point to this new node, 7. Go to step 3.

The tree structure behaviors for the above session can be recovered with our strategyas the following figure:

Tree structure behaviours can help to mine those access patterns with tree structureand also help to mine most forward sequential patterns or deepest access path.

Trees are used to store strings for pattern matching application. Each characteristic inthe string is stored on the edge to the node. Common prefixes of strings are shared. Aproblem in using tree for many long strings is the space required. This is shown in thefollowing figure, which shows a standard tree for three strings SAD, DIAL and DIALOG.

The degree of each node is one and required more space. The extra node “S” denotesthe termination of the string DIAL. The above tree can be drawn as follows:

S A D

D I A L O G

S

DIAL

SAD

OG

S


SUSHIL KULKARNI 21

[email protected]

8. Finding Unusual Itemsets

Consider the problem to find sets of words that appear together “unusually often" onthe Web, e.g., “New" and “York" or {“Dutchess", “of", “York"}.

“Unusually often" can be defined in various ways, in order to capture the idea that thenumber of Web documents containing the set of words is much greater than what onewould expect if words were sprinkled at random, each word with its own probability ofoccurrence in a document.

One appropriate way is entropy per word in the set. Formally, the interest of a set ofwords S is

Note that we divide by the size of S because there are so many sets of a given size thatsome, by chance alone, will appear to be correlated.

For example, if words a, b, and c each appear in 1% of all documents, and S = {a; b; c}appears in 0.1% of documents, then the interestingness of S is

This shows that a set S with a high value of interests, yet some, or even all, of itsimmediate proper subsets are not interesting. In contrast, if S has high support, then allof its subsets have support at least as high. This means that if more than 10 8 differentwords appearing in the Web, it is not possible even to consider all pairs of words.

8.1 The DICE Engine

DICE (dynamic itemset counting engine) repeatedly visits the pages of the Web, in around-robin fashion. At all times, it is counting occurrences of certain sets of words, andof the individual words in that set. The number of sets being counted is small enoughthat the counts fit in main memory.

From time to time, say every 5000 pages, DICE reconsiders the sets that it is counting.It throws away those sets that have the lowest interest, and replaces them with othersets.

The choice of new sets is based on the heavy edge property, which is an experimentally

S

PSP

S

Π∈ω

ω )()(log

3.33

1000log3

01.001.001.0001.0log

22

=

=

××


SUSHIL KULKARNI 22

[email protected]

justified observation that those words that appear in a high-interest set are more likelythan others to appear in other high-interest sets. Thus, when selecting new sets to startcounting, DICE is biased in favor of words that already appear in high-interest sets.However, it does not rely on those words exclusively, or else it could never find highinterests sets composed of the many words it has never looked at. Some (but not all) ofthe constructions that DICE uses to create new sets are:

1. Two random words. This is the only rule that is independent of the heavy edgeassumption, and helps new words get into the pool.

2. A word in one of the interesting sets and one random word.

3. Two words from two different interesting pairs.

4. The union of two interesting sets whose intersection is of size 2 or more.

5. { a; b; c} if all of {a; b}, {a; c}, and {b; c} are found to be interesting.

Of course, there are generally too many options to do all of the above in all possibleways, so a random selection among options, giving some choices to each of the rules, isused.

8.2 Books and Authors

The general idea is to search the Web for facts of a given type, typically what mightform the tuples of a relation such as Books (title; author). The computation is suggestedby the following figure

1. Start with a sample of the tuples one would like to find. Consider five examples ofbook titles and their authors were used.

2. Given a set of known examples, find where that data appears on the Web. If apattern is found that identifies several examples of known tuples, and is sufficientlyspecific that it is unlikely to identify too much, then accept this pattern.

Current data

Current pattern

FindPattern

Find data

Sample data


SUSHIL KULKARNI 23

[email protected]

3. Given a set of accepted patterns, find the data that appears in these patterns, add itto the set of known data.

4. Repeat steps (2) and (3) several times. In the example cited, four rounds were used,leading to 15,000 tuples; about 95% were true title-author pairs.

But the question is what is a pattern? Let us answer it now.

8.3 What is a pattern?

The notion suggested consists of five elements:

1.The order: i.e., whether the title appears prior to the author in the text, or vice-versa. In a more general case, where tuples have more than 2 components, the orderwould be the permutation of components.

2. The URL prefix.

3. The prefixes of text, just prior to the first of the title or author.

4. The middle: text appearing between the two data elements.

5. The suffix of text following the second of the two data elements. Both the prefix andsuffix were limited to 10 characters.

For example, A possible pattern might consist of the following:

1. Order: title then author.2. URL prefix: www. University_Mumbai.edu/class/3. Prefix, middle, and suffix of the following form:

<LI>title by author

Here the prefix is <LI>, the middle is by (including the blank after “by"), andthe suffix is . The title is whatever appears between the prefix and middle; theauthor is whatever appears between the middle and suffix.

To focus on patterns that are likely to be accurate, one can used several constraints onpatterns, as follows:

oo Let the specificity of a pattern be the product of the lengths of the prefix, middle,suffix, and URL prefix. Roughly, the specificity measures how likely we are to findthe pattern; the higher the specificity, the fewer occurrences we expect.

oo Then a pattern must meet two conditions to be accepted:

1. There must be at least 2 known data items that appear in this pattern.


SUSHIL KULKARNI 24

[email protected]

2. The product of the specificity of the pattern and the number of occurrences ofdata items in the pattern must exceed a certain threshold T (not specified).

8.4 Data Occurrences

An occurrence of a tuple is associated with a pattern in which it occurs; i.e., the sametitle and author might appear in several different patterns. Thus, a data occurrenceconsists of:

1. The particular title and author.2. The complete URL, not just the prefix as for a pattern.3. The order, prefix, middle, and suffix of the pattern in which the title and author

occurred.

8.5 Finding Data Occurrences Given Data

If we have some known title-author pairs, our list step in finding new patterns is tosearch the Web to see where these titles and authors occur. We assume that there is anindex of the Web, so given a word, we can find (pointers to) all the pages containingthat word. The method used is essentially a-priori:

1. Find (pointers to) all those pages containing any known author. Since author namesgenerally consist of 2 words, use the index for each first name and last name, andcheck that the occurrences are consecutive in the document.

2. Find (pointers to) all those pages containing any known title. Start by finding pageswith each word of a title, and then checking that the words appear in order on thepage.

3. Intersect the sets of pages that have an author and a title on them. Only these pagesneed to be searched to find the patterns in which a known title-author pair is found.For the prefix and suffix, take the 10 surrounding characters, or fewer if there are notas many as 10.

8.6 Building Patterns from Data Occurrences

1. Group the data occurrences according to their order and middle. For example, onegroup in the “group-by" might correspond to the order “title-then-author" and themiddle “” by .

2. For each group, find the longest common prefix, suffix, and URL prefix.

3. If the specificity test for this pattern is met, then accept the pattern.

4. If the specificity test is not met, then try to split the group into two by extending thelength of the URL prefix by one character, and repeat from step 2. If it is impossibleto split the group (because there is only one URL) then we fail to produce a patternfrom the group.


SUSHIL KULKARNI 25

[email protected]

Consider the example, where the our group contains the three URL's:

www.University_Mumbai.edu/class/cs345/index.html www.University_Mumbai.edu/class/cs145/intro.html www.University_Mumbai.edu/class/cs140/readings.html

Where cs345,cs 145 and cs140 are the code for the three subjects say advanceddatabases, Java and UML.

The common prefix is www.University_Mumbai.edu/class/cs . If we have to split thegroup, then the next character, 3 versus 1, breaks the group into two, with those dataoccurrences in the first page (there could be many such occurrences) going into onegroup, and those occurrences on the other two pages going into another.

8.7 Finding Occurrences Given Patterns

1. Find all URL's that match the URL prefix in at least one pattern.

2. For each of those pages, scan the text using a regular expression built from thepattern's prefix, middle, and suffix.

3. Extract from each match the title and author, according the order specified in thepattern.

ggggeeeeiiiibbbbvvvv


http://www.University_Mumbai.edu/class/cs345/index.html

http://www.University_Mumbai.edu/class/cs145/intro.html

http://www.University_Mumbai.edu/class/cs140/readings.html

http://www.University_Mumbai.edu/class/cs

Documents

Web Minning