70
Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Embed Size (px)

Citation preview

Page 1: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Mining

Václav Snášel, Miloš KudělkaVSB-Technical University of Ostrava

Czech Republic

Page 2: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Outline... Introduction Classification of recent approaches. Web page segmentation. Genre detection, Table extraction. Opinion, News, and Discussion extraction. Product details and Technical features

extraction

Knowledge Engineering Group Praha 2009 2

Page 3: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web content mining describes the discovery of useful information from Web contents. The goal of Web content mining is to improve finding information or filtering information for the users.

Web structure mining tries to discover the model underlying the link structures of the Web. This model can be used to categorize Web pages and can be useful to generate the relationship between Web sites.

Web usage mining tries to make sense of the data generated by the Web surfer's sessions or behaviors. Web usage mining mines the data derived from the interactions of the users.

Web mining

Knowledge Engineering Group Praha 2009 3

Page 4: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web mining

T ext

Im age

A ud io

V ideo

S truc tu red R eco rds

W eb C on ten t M in ing

H yper l inks

D ocum en t S truc tu red

D eep struc tu re

C a tego ry s truc tu re

C om m un ity s truc tu re

W eb S tructu re M in ing

W eb S erve r Logs

A pp licat ion Leve l Logs

A pp lica tion S e rve r Logs

W eb 2 .0

W eb U sage M in ing

W eb M in ing

Web mining - the application of data mining techniques to extract knowledge from Web content, structure, and usage.

4Knowledge Engineering Group

Praha 2009

Page 5: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Content Mining Web Content Mining is the process of extracting

useful information from the contents of Web documents. It may consist of text, images, audio, video, or structured records such as lists and tables.

Web Content mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.

Knowledge Engineering Group Praha 2009 5

Page 6: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Structure Mining Web structure mining tries to discover the model

underlying the link structures of the Web. The model is based on the topology of the hyperlinks with or without the description of the links.

This model can be used to categorize Web pages and is useful to generate information such as the similarity and relationship between different Web sites. Web structure mining could be used to discover authority sites for the subjects (authorities) and overview sites for the subjects that point to many authorities (hubs).

Knowledge Engineering Group Praha 2009 6

Page 7: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Usage Mining Web 2.0, enables individuals to create and

share content on the Web. One of the important distinguishing features of Web 2.0 is the creation of communities of users, i.e., social networks with new demands on data management. In social content sites, both content and user interest are dynamic: people review and tag new content every day.

Knowledge Engineering Group Praha 2009 7

Page 8: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Usage Mining Web usage mining, which aims to discover

interesting and frequent user access patterns from web usage data, can be used to model past web access behavior of users.

The acquired model can then be used for analyzing and predicting the future user access behavior.

Knowledge Engineering Group Praha 2009 8

Page 9: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Opportunities and ChallengesWeb offers an unprecedented opportunity and challenge to data mining The amount of information on the Web is huge, and easily

accessible. The coverage of Web information is very wide and diverse. One can

find information about almost anything. Information/data of almost all types exist on the Web, e.g.,

structured tables, texts, multimedia data, etc. Much of the Web information is semi-structured due to the nested

structure of HTML code. Much of the Web information is linked. There are hyperlinks

among pages within a site, and across different sites.

Knowledge Engineering Group Praha 2009 9

Page 10: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Opportunities and Challenges The Web is noisy. A Web page typically contains a mixture of many

kinds of information, e.g., main contents, advertisements, navigation panels, copyright notices, etc.

Above all, the Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i.e., communities.

The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services.

The Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues. Knowledge Engineering Group

Praha 2009 10

Page 11: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Knowledge Engineering Group Praha 2009 11

Web content mining

Page 12: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

12

Web content mining algorithm is like a blindfolded person...

Algorithms for the detection of page type (Genre detection).

Algorithms for the detection of page parts (on a domain dependent or domain independent level).

Algorithms for the extraction of information content (Web information extraction).

Knowledge Engineering Group Praha 2009

Page 13: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

13

Visual layout based Web page analysis...

The trend is evolving towards visual layout based Web page analysis...

A Web page is represented by various individuals’ formats (VIPS, MDR, m-tree, zone-tree,...).

The purpose is to find data records (or sub Web pages with a useful content).

The aim can be a comparison of two Web pages or sub Web pages.

Knowledge Engineering Group Praha 2009

Page 14: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Genre detection methods

The goal of Genre detection methodsis to assign the Web page to a known type...

Methods are based on existing (manually identified) classifications.

In traditional genre classification, one page belongs to a single genre.

There is a need of multi genre classification schemes. Known approaches are focused on home pages, e-shopping, academic Web pages, news, and blogs.

Knowledge Engineering Group Praha 2009 14

Page 15: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Tables

Tables are an important element for structuring related data...

Domain independent Named Web object Tables are analyzed along four aspects:

Physical - a description in terms of inter-cell relative location Structural - the topology of cells as an indicator of their

navigational relationship Functional - the purpose of areas of the tables in terms of data

access Semantic - the meaning of text in the table and the

relationship between the interpretation of cell content

Knowledge Engineering Group Praha 2009 15

Page 16: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

16

Opinion extraction

Opinion extraction is about how to summarizecustomer opinions on product features...

Domain dependent Named Web objects The main source for analysis:

Opinions of customers on product Web pages Discussions on thematic forums Individual reviews in the form of articles

Knowledge Engineering Group Praha 2009

Page 17: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

17

Product details Product details and features usually

contain a picture, product name, price information...

Domain dependent Named Web object The main source for analysis is a Product page How to extract information and save into a

database and then use it How to extract product technical features (the aim

is to be able to compare similar products)

Knowledge Engineering Group Praha 2009

Page 18: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

DynamicMining Current methods of Web content mining focus on analyzing

static web sites and cannot deal with constantly changing web sites, such as news sites. Dynamic Mining propose a method for mining online news sites. This method applies dynamic schemes for exploring these web sites and extracting news reports, and uses domain independent statistical analysis for trend analysis. The overall method is an application of web mining that goes beyond straightforward news analysis, trying to understand current society interests and to measure the social importance of ongoing events.

Knowledge Engineering Group Praha 2009 18

Page 19: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web page is like a family house

Knowledge Engineering Group Praha 2009 19

Page 20: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Each of its sections has its significance, determined by the function which it serves.

Every section can be named so that everybody imagines the same thing under that name.

Three tasks for a blindfolded person: what sections the building contains the purpose of the building furnishings of individual sections

Web page is like a family house

Knowledge Engineering Group Praha 2009 20

Page 21: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Usability principles are a good foundation

Web usability received renewed attention as many early e-commerce Web sites started failing in 2000 (Wikipedia).

User Centered Design - corresponds to what users are used to and does not make the user change their way of working.

In which way does the visual organization of the Web pages help to lead the visual exploration for information retrieval?

Knowledge Engineering Group Praha 2009 21

Page 22: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Usability principles are a good foundation

Eye-tracking conclusion:

It must be compatible with the set of the designer's intentions.

It must be compatible with the set of the user's potentials

Knowledge Engineering Group Praha 2009 22

Page 23: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Task for web mining

User

Technology

Intent

Knowledge Engineering Group Praha 2009 23

intent

concept

user

Interaction

Technology

Implementation

Application domain

User Interaction Technology solution

User interface

Software design

Page 24: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns, Patterns language

Knowledge Engineering Group Praha 2009 24

Page 25: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns in Architecture Does this room makes

you feel happy? Why?

Light (direction) Proportions Symmetry Furniture And more…

Knowledge Engineering Group Praha 2009 25

Page 26: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns - LIGHT ON TWO SIDES OF EVERY ROOM

Architecture, Design Patterns, … “When they have a choice, people will

always gravitate to those rooms which have light on two sides, and leave the rooms which are lit only from one side unused and empty.”

(Alexander et al., 1977 pattern 159)

Knowledge Engineering Group Praha 2009 26

Page 27: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns - LIGHT ON TWO SIDES OF EVERY ROOM

The solution is then included: “Locate each room so that it has

outdoor space outside it on at least two sides, and then place windows in these outdoor walls so that natural light falls into every room from more than one direction. “

(Alexander et al., 1977 pattern 159)Knowledge Engineering Group Praha 2009 27

Page 28: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns Architecture, Design Patterns, … In essence, patterns are structural and

behavioral features that improve the applicability of software architecture, a user interface, a Web site or something another in some domain.

J. Tidwell, Designing Interfaces: Patterns for Effective Interaction Design, O'Reilly Media, Inc., 2006. Knowledge Engineering Group

Praha 2009 28

Page 29: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

What is a Design Pattern?

In Short, a solution for a

typical problem

A description of a recurrent problem and of the core of possible solutions.

Knowledge Engineering Group Praha 2009 29

Page 30: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Why do we need Patterns? Reusing design knowledge

Problems are not always unique. Reusing existing experience might be useful.

Patterns give us hints to “where to look for problems”.

Establish common terminology Easier to say, "We need a Facade here“.

Provide a higher level prospective Frees us from dealing with the details too early

In short, it’s a “reference”Knowledge Engineering Group

Praha 2009 30

Page 31: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

History of Design Patterns

Christopher AlexanderThe Timeless Way of BuildingA Pattern Language: Towns, Buildings, Construction

1970’

1995’

2007’

Architecture

Object OrientedSoftware Design

Other Areas:HCI, Organizational Behavior,

Education, Concurent Programming…

Gang of Four (GoF)Design Patterns: Elements of Reusable Object-Oriented Software

Many Authors

Knowledge Engineering Group Praha 2009 31

Page 32: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Structure of a design pattern* Pattern Name and Classification Intent

a Short statement about what the pattern does

Motivation A scenario that illustrates where the pattern

would be useful Applicability

Situations where the pattern can be used*According to GoF

Knowledge Engineering Group Praha 2009 32

Page 33: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Structure of a design pattern

Structure A graphical representation of the pattern

Participants The classes and objects participating in the

pattern Collaborations

How to do the participants interact to carry out their responsibilities?

Consequences What are the pros and cons of using the

pattern? Implementation

Hints and techniques for implementing the pattern

Knowledge Engineering Group Praha 2009 33

Page 34: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns Architecture, Design Patterns, … In essence, patterns are structural and

behavioral features that improve the applicability of software architecture, a user interface, a Web site or something another in some domain.

J. Tidwell, Designing Interfaces: Patterns for Effective Interaction Design, O'Reilly Media, Inc., 2006. Knowledge Engineering Group

Praha 2009 34

Page 35: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns – Catalogue 1/1

There are catalogs of patterns. For example:

Tidwell, Designing Interfaces: Patterns for Effective Interaction Design. O'Reilly Media, Inc., 2006.

For pattern description we use the structure

originated by Kent Beckhttp://c2.com/cgi/wiki?BeckForm

Knowledge Engineering Group Praha 2009 35

Page 36: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns catalogue 1/2Site types· Web-based Application· Artist Site· Automotive Site· Branded Promotion Site· Campaign Site· E-commerce Site· Community Site· Corporate Site· Multinational Site· Museum Site· Personalized 'My' Site· News Site· Portal Site· Travel Site

Experiences· Community Building· Information Management

· Fun· Information Seeking· Learning· Assistence· Shopping· Story Telling

Page types· Article Page· Blog Page· Case Study· Contact Page· Event Calendar· Forum· Guest Book· Help Page· Homepage· Newsletter· Printer-friendly Page· Product Page· Tutorial

Knowledge Engineering Group Praha 2009 36

http://www.welie.com/patterns/index.php

Context of design

Page 37: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Pattern

Title – appropriate pattern nameProblem: A single brief sentence describing the

problem which pattern solves.Context: A list of situations where the pattern

occurs.Forces: A list of details which influence the

pattern identification. We are focusing especially on features useful for automatic detection.

Solution: Description of the solution with examples.

Knowledge Engineering Group Praha 2009 37

Page 38: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns - Example http://www.welie.com/patterns/index.p

hp

Knowledge Engineering Group Praha 2009 38

Page 39: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Pattern – (Toy) Example<?xml version="1.0" encoding="utf-8" ?>- <PATTERN> <ID>0</ID> <NAME>Information about price</NAME> <PROXIMITY>8</PROXIMITY> <BASE_WEIGHT>1</BASE_WEIGHT> <PROMINENCE_WEIGHT>1</PROMINENCE_WEIGHT> <COMPOSITE_WEIGHT>2</COMPOSITE_WEIGHT> <RECURRENT_WEIGHT>0,25</RECURRENT_WEIGHT> <TEXTUAL_WEIGHT>0</TEXTUAL_WEIGHT> <SYNERGY_WEIGHT>2</SYNERGY_WEIGHT>- <PRIMARY_KEYWORDS> <WORD>EU</WORD> <WORD>Dollar</WORD> <WORD>Price</WORD> </PRIMARY_KEYWORDS>- <SECONDARY_KEYWORDS> <WORD>Price</WORD> <WORD>Prices</WORD> <WORD>monetary

value</WORD> <WORD> guarantee </WORD> <WORD> warranty </WORD> <WORD> guaranty </WORD> <WORD> goods

</WORD> <WORD> commodity </WORD> </SECONDARY_KEYWORDS>- <PRIMARY_ONTOLOGIES> <WORD><price_token></WORD> </PRIMARY_ONTOLOGIES> - <SECONDARY_ONTOLOGIES>

<WORD><percentage_token></WORD> </SECONDARY_ONTOLOGIES> </PATTERN>

Knowledge Engineering Group Praha 2009 39

Page 40: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles

Knowledge Engineering Group Praha 2009 40

Page 41: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles Gestalt is also known as the "Law

of Simplicity" or the "Law of Prägnanz" (the entire figure or configuration), which states that every stimulus is perceived in its most simple form.

Gestalt theorists followed the basic principle that the whole is greater than the sum of its parts. Knowledge Engineering Group

Praha 2009 41

Page 42: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles In other words, the whole (a picture,

a car) carried a different and altogether greater meaning than its individual components (paint, canvas, brush; or tire, paint, metal, respectively). In viewing the "whole," a cognitive process takes place – the mind makes a leap from comprehending the parts to realizing the whole.

Knowledge Engineering Group Praha 2009 42

Page 43: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles can provide a theoretical base...

Proximity: If things are close together viewers will associate them with one another.

Similarity: Similar elements tend to be perceived as a group.

Continuity: Our eyes want to see continuous lines and curves formed by the alignment of smaller elements.

Closure: Elements are not completely enclosed in a space. If enough information is provided, elements tend to be perceived as a group.

Knowledge Engineering Group Praha 2009 43

Page 44: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles

Visual systems usually implement the four basic principles:

Proximity - Similar information are close. Similarity – Similar things have silmilar

meanin. Continuity- Each information follow one by

one. Closure – Related information are

grouping.Knowledge Engineering Group

Praha 2009 44

Page 45: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Knowledge Engineering Group Praha 2009 45

(proximity, similarity, continuity, closer)

Page 46: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles – Web page

We want to buy mobile phone

Knowledge Engineering Group Praha 2009 46

Page 47: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Gestalt principles – Web page

Knowledge Engineering Group Praha 2009 47

Page 48: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns - Gestalt principles

Following the Gestalt principles we can suppose a page pattern as a group of characteristic technical elements (whose are based on GUI patterns such as lists, tables, continuous texts) and group of domain specific elements for the domain we are involved in (typical keywords related to given pattern and other entities such as the price, date, percent etc.).

The key aspect of the pattern manifestation is that the introduced elements are close to each other.

Knowledge Engineering Group Praha 2009 48

Page 49: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Object retrieval

Object-level Information Extraction – A Web object is constructed by collecting related data records extracted from multiple Web sources. The sources for holding object information could be HTML pages, documents put on the Web (e.g. PDF, PS, Word, and other formats.), and deep contents hidden in Web databases. (In previuos Figure.)Knowledge Engineering Group

Praha 2009 49

Page 50: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Object retrieval There is already extensive research to

explore algorithms for extraction of objects from Web sources.

Object Identification and Integration – Each extracted instance of a Web object needs to be mapped to a real world object and stored into the Web data warehouse. To do so, we need techniques to integrate information about the same object and disambiguate different objects.Knowledge Engineering Group

Praha 2009 50

Page 51: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Motivation Web object retrieval – After information

extraction and integration, we should provide retrieval mechanism to satisfy users’ information needs. Basically, the retrieval should be conducted at the object level, which means that the extracted objects should be indexed, ranked and clustered against user queries.

Knowledge Engineering Group Praha 2009 51

Page 52: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Algorithm 1. For proximity we defined method how to measure

closeness (distance) between entities in searched text segments.

2. For similarity we defined method for measuring similarity of two searched text segments (for Discussion we are able to identify repetition of replies). We work with comparison of trees representing text segments.

3. For continuity we defined method how to find out whether two or more found text segments make together instance of pattern. We assume that two or more little-similar text segments (trees of entities from one pattern) match together.

4. For closure we defined a method for computation of weight of one single searched text segment. In essence we used two criteria. We rated shape of the segment tree (particularly ratio of height and entity count) and quantity of all words and paragraphs in text segment. On the overall computation of weight also the proximity rate participates.

Knowledge Engineering Group Praha 2009 52

Page 53: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Algorithm – membership computation

FOR each page entity in all page entities IF page entity is pattern entity THEN IF does not exist snippet to add page entity to THEN create new snippet in list of snippets END IF add page entity to snippet END IF END FOR FOR each snippet in list of snippets compute proximity of snippet compute closure of snippet compute value(proximity, closure) of snippet IF value is not good enough THEN remove snippet from list of snippets END IF END FOR compute similarity of list of snippets compute continuity of list of snippets compute value(similarity, continuity) of pattern RETURN value

Knowledge Engineering Group Praha 2009 53

Page 54: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Experiments We collected 31,738 various web pages

which we got from the Google search engine using queries on products. After the analysis we discovered that on the 11,038 web pages there was not any extracted patterns.

There were more than 200 searches of products tested (cellular phones, computers, components and peripheries, electronics, sport equipment, cosmetics, books, CDs, DVDs, etc.).

Knowledge Engineering Group Praha 2009 54

Page 55: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Experiments - Re-ranking

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

5 10 15 20 25 30

Standard

Patterns

Knowledge Engineering Group Praha 2009 55

Page 56: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Experiments - Retrieval Accuracy

relevant pages retrieved in top T returnsRA

T

Knowledge Engineering Group Praha 2009 56

Page 57: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Experiments - Re-ranking

Knowledge Engineering Group Praha 2009 57

Page 58: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Implementation

Knowledge Engineering Group Praha 2009 58

Page 59: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Pattrio: Inspired by Patterns and Objects... Web design patterns and patterns

languages Named Web object as a Web design

pattern projection Catalog of Named Web objects Detection of Named Web objects Use of Named Web Objects

Knowledge Engineering Group Praha 2009 59

Page 60: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Named Web objects can provide a simple description for SERP...

Knowledge Engineering Group Praha 2009 60

Page 61: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Patterns are intended for developersand they do not contain technical details...

Design pattern is a text description about how to solve an existing problem.

Technical details are important for the recognition by the user (and by the algorithm).

A different description has to be used (Pattrio catalog).

The Named object is a projection of a Web design pattern (or Genre) to a concrete part of Web page.

Knowledge Engineering Group Praha 2009 61

Page 62: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Pattern Extraction In our experiment we were searching web pages in

sets of thirty using very precise query. The query contained product identification (ex. Nokia 9300) and group of six words from the pattern dictionary connected in OR relation for making query more accurate. From the searched pages our algorithm extracted nine patterns (Price Information, Purchasing Possibility, Special Offer, Annuity Selling, Product information, Discussion, Review, Sign on possibility, Advertising). For evaluation of each pattern we used seven criterions. Each criterion was rated using three-degree scale. In all it is expressed using 21 Boolean values.

Knowledge Engineering Group Praha 2009 62

Page 63: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

SOM — web pages from selling product domain

Knowledge Engineering Group Praha 2009 63

Page 64: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

SOM — web pages from selling product domain

Knowledge Engineering Group Praha 2009 64

Page 65: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Communities Defined by Web Page Content

Knowledge Engineering Group Praha 2009 65

Page 66: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Web Communities Defined by Web Page Content

Knowledge Engineering Group Praha 2009 66

In this part we are looking for a relationship between the intent of Web pages, their architecture and the communities who take part in their usage and creation.

For us, the Web page is entity carrying information about these communities.

We present an experiment which proves the feasibility of our approach.

Page 67: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Typical Web site aimed at information sharing

Knowledge Engineering Group Praha 2009 67

Page 68: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Typical university Web site

Knowledge Engineering Group Praha 2009 68

Page 69: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Vision The crucial aspect of our approach is that

we do not need to analyze page’s HTML code. Our algorithm is based on analysis of plain text of the page. For page evaluation we do not use any meta‑information about page (such as title, hyperlinks, meta‑tags and so on). We also confirmed that key characteristics of web patterns are independent of language environment. We tested our method in English and Arabic and Czech language environment. The only thing we had to do was to change patterns dictionaries.

Knowledge Engineering Group Praha 2009 69

Page 70: Web Mining Václav Snášel, Miloš Kudělka VSB-Technical University of Ostrava Czech Republic

Knowledge Engineering Group Praha 2009 70

Thank you