70
Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 [email protected]

Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Detecting and Representing Relevant Page-Level Web DeltasSanjay Kumar MadriaDepartment of Computer SciencePurdue UniversityWest Lafayette, IN [email protected]

Page 2: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Current Situation of W3

The Web allows information to change at any time and in any way

Two forms of changes Existence Structure and content

modification Leaves no trace of the

previous document

Replaces its antecedents leaving no trace!!!!

Page 3: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Problems of Change Management Problem:

Detecting, Representing and Querying these changes

The problem is challenging Typical database approaches to detect changes

based on triggering mechanisms are not usable Information sources typical do not keep track

of historical information to a format that is accessible to the outside user

Page 4: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Motivating Example Assume that there is a web site at

www.panacea.gov Provides information related to drugs used for

various diseases

Page 5: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Motivating Example

Suppose, on 15th January, a user wishes to find out periodically (every 30 days)

information related to side effects and uses of drugs used for various drugs and

changes to these information at the page-level compared to its previous version

Page 6: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Structure of www.panacea.gov Web page at www.panacea.gov contains a list of

diseases Each link of a particular disease points to a web

page containing a list of drugs used for prevention and cure of the disease

Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc)

From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug

Page 7: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

A Snapshot as on 15th Jan

AIDS

Cancer

Heart disease

Diabetes

Impotence

Alzheimer’sDisease

Indavir

Ritonavir

Niacin

Hirudin

Vasomax

Caverject

Side effects

Uses

Side effects

Uses

Side effects

Uses

Uses

Side effectsSide effects

Ibuprofen

Page 8: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Some Changes 25th January

Links related to Diabetes are removed New link containing information related to

Parkinson’s Disease Information related to issues, side-effects and

uses of various drugs for Cancer are also modified

Page 9: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

A Partial Snapshot as on 25th Jan

Parkinson’sDisease

Cancer

Diabetes

TolcaponeSide effects

Uses

Side effects

www.panacea.gov

Page 10: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Some Changes 30th January

Links related to Impotence is modified• Previously provided by www.pfizer.com• Now by www.panacea.gov

Inter-linked structure of the Web pages related to Caverject is also modified

Information about Viagra, a new drug for Impotence is added

Page 11: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

A Partial Snapshot as on 30th Jan

Impotence

Vasomax

Caverject

Side effects

Uses

Uses

Side effects

Viagra

www.panacea.gov

Page 12: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Some Changes 8th February

Link structure of Heart Disease is modified• Label Heart Disease is modified to Heart

Disorder• Content of the pages dealing with side-

effects and uses of Hirudin are updated• Inter-linked document structure of Niacin is

modified Web pages related to the side effects and uses

of Ibuprofen (Alzheimer’s Disease) are removed

Page 13: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

On 8th February

Heart disorderAlzheimer’s

Disease

Niacin

HirudinSide effects

Uses

Side effects

www.panacea.gov

Page 14: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

A Snapshot as on 15th Feb

AIDS

Cancer

Heart disease

Impotence

Alzheimer’sDisease

Indavir

Ritonavir

Niacin

Hirudin

Vasomax

Caverject

Side effects

Uses

Viagra

Parkinson’sDisease

Page 15: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Objectives Web deltas - Changes to web information Detecting and representing relevant page-level web

deltas changes that are relevant to user’s query, not any

arbitrary changes or web deltas Restricted to page level

Detect those documents which are added to the site deleted from the site those documents which has undergone content or

structural modification How these delta documents are related to one another

and with other documents relevant to the user’s query

Page 16: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

The WHOWEDA Project WHOWEDA: A WareHouse of WEb DAta To design and implement a web warehousing

system capable of effective extraction, management, and processing of information on the World Wide Web

Data model: WHOM (WareHouse Object Model)

Page 17: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Overview of WHOM Our web warehouse can be conceived of as a

collection of web tables A set of web tuples and a set of web schemas

represents a web table A web tuple is a directed graph containing nodes and

links and satisfies a web schema Nodes and links contain content, metadata and

structural information associated with Web documents and hyperlinks

Tree representation Web algebra containing web operators to manipulate

web tables Global Coupling, Web Select, Web Join etc.

Page 18: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Overview of our approach Step 1: Two snapshots of old and new relevant

data is coupled from the Web using global web coupling operation and materialized in two web tables.

Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables

Result is joined, left and right outer joined web tables Step 3: Delta web tables containing different types

of web deltas are generated from these resultant web tables.

Elaborate on these steps……...

Page 19: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Step 1: Retrieving snapshots of Web data using Global Web Coupling

Page 20: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Query Specification Features:

Draw a web query as a directed connected acyclic graph (also called a coupling query)

Query can also be specified in text form Specify search conditions on the nodes and

edges of the graph Performed by the global web coupling

operator

Page 21: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Coupling Query Set of node variables Xn

Each variable represents set of Web documents Set of link variables Xl

Each variable represent set of hyperlinks Set of connectivities C in DNF defined over node

and link variables To specify hyperlink structure of the documents

Set of predicates P defined over some of the node and link variables

Specify metadata, content or structural conditions Set of coupling query predicates Q

Conditions on execution of the query

Page 22: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example

Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov

information related to side effects and uses of drugs used for various diseases

Result of the query is stored in the form of web table

Page 23: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Coupling Query

Xn = {a, b, d, k} Xl = { - } P = {p1, p2, p3, p4}

p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov”

p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list”

p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses”

p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects”

Page 24: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Coupling Query

C = k1 AND k2 AND k3 k1 = a < - > b k2 = b < -{1, 6} > d k3 = b < -{1, 3} > k

Q = {q1} q1(b) = COUPLING_QUERY:: polling_frequency

EQUALS “30 days”

Page 25: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Pictorial Representation

a b

k

d

www.panacea.gov

“drug list”

“side effects”

“uses”

{1, 3}

{1, 6}

Page 26: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table Drugs (15th Jan)b0a0 u0

k0

d0

AIDSIndavir

b0a0 u1

k1

d1

AIDSRitonavir

b1a0

k2

d2

Cancer

Beta Carotene

b5a0

k12

d12

Alzheimer’sDisease

Ibuprofen

Page 27: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table Drugs (15th Jan)b3a0 d4 k5

DiabetesAlbuterol

b4a0 u4

k6

d5

Impotence Vasomax

u6u5

b4a0 u7

k7

d6

ImpotenceCavarject

u8

b2a0 u2

k3

d3Heart

DiseaseHirudin

Page 28: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table New Drugs (15th Feb)

b0a0 u0

k0

d0

AIDSIndavir

b0a0 u1

k1

d1

AIDSRitonavir

b1a0

k2

d2

Cancer

Beta Carotene

b2a0 u2

k3

d3Heart

DisorderHirudin

Page 29: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table New Drugs (15th Feb)

b2a0 u3

k7

d7Heart

DisorderNiacin

b4a0 u7

k7

d6

ImpotenceCavarject

b4a0 u9

k8

d8

Impotence Vasomax

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

Page 30: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table New Drugs (15th Feb)

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

b4a0 u12

k9

d9

Impotence Viagra

Page 31: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Step 2: Performing Web Join, Left and Right Outer Web Join

Page 32: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Join Information composition operator Combines two web tables into a single web table

under certain conditions Combine two web tables by concatenating a web

tuple of one web table with a web tuple of other web table whenever there exist joinable nodes

Two nodes are joinable if they are identical Two nodes are identical if the URL and last

modification date of the nodes are same The joined web tuple is stored in a different web

table

Page 33: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Join Join web tables Drugs and New Drugs Nodes which has not undergone any changes

are the joinable nodes in these two web tables.

Content modified nodes, new nodes and deleted nodes cannot be joinable nodes

Page 34: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Joined web tableb0a0 u0

k0

d0AIDS Indavir

a0

AIDS

b0a0 u1

k1

d1

AIDSRitonavir

a0

AIDS

(1)

(2)

b0a0 u0

k0

d0

AIDSIndavir

a0 u1

k1

d1

AIDS

Ritonavir

(3)

Page 35: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Joined Web Tableb2a0 u3

k4

d7Heart

DisorderNiacin

a0 u2

k3

d3Heart

DiseaseHirudin

(4)

b4a0 u7

ImpotenceCavarject

b4a0 u7

k7

d6

ImpotenceCavarject

u8

(5)

Page 36: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Joined Table

b2a0 u2

k3

d3Heart

DiseaseHirudin

a0 u2

k3

d3Heart

Disorder

Hirudin

(6)

Page 37: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Types of web tuples Web tuples in which all the nodes are joinable

Results of joining two versions of web tuples that has remained unchanged during the transition

Web tuples in which some of the nodes are joinable nodes remaining nodes are the result of insertion,

deletion or modification operations

b4a0 u7

ImpotenceCavarject

b4a0 u7

k7

d6

ImpotenceCavarject

u8

(5)

Page 38: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Types of web tuples Tuples in which

Some of the nodes are joinable nodes Out of the remaining nodes some are result of

insertion, deletion or modification and The remaining ones remained unchanged

during the transition

b0a0 u0

k0

d0

AIDSIndavir

a0 u1

k1

d1

AIDS

Ritonavir

(3)

Page 39: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Outer Web Join Web tuples that do not pariticipate in the web

join process (dangling web tuples) are absent from the joined web table

Outer web join enables us to identify them Left outer web join Right outer web join

Page 40: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table New Drugs (15th Feb)

b0a0 u0

k0

d0

AIDSIndavir

b0a0 u1

k1

d1

AIDSRitonavir

b1a0

k2

d2

Cancer

Beta Carotene

b2a0 u2

k3

d3Heart

DisorderHirudin

Page 41: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table New Drugs (15th Feb)

b2a0 u3

k7

d7Heart

DisorderNiacin

b4a0 u7

k7

d6

ImpotenceCavarject

b4a0 u9

k8

d8

Impotence Vasomax

Page 42: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table New Drugs (15th Feb)

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

b4a0 u12

k9

d9

Impotence Viagra

Page 43: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Right Outer Web Join

b1a0

k2

d2

Cancer

Beta Carotene

b4a0 u9

k8

d8

Impotence Vasomax

b4a0 u12

k9

d9

Impotence Viagra

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

Page 44: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Types of web tuples New web tuples which are added during the

transition These tuples contain some new nodes and

remaining ones content are changes Tuples in which all the nodes have undergone

content modification Tuples which existed before and in which

some of the nodes are new and remaining ones content have changed.

Page 45: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table Drugs (15th Jan)b0a0 u0

k0

d0

AIDSIndavir

b0a0 u1

k1

d1

AIDSRitonavir

b1a0

k2

d2

Cancer

Beta Carotene

b5a0

k12

d12

Alzheimer’sDisease

Ibuprofen

Page 46: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Web Table Drugs (15th Jan)

b3a0 d4 k5

DiabetesAlbuterol

b4a0 u4

k6

d5

Impotence Vasomax

u6u5

b4a0 u7

k7

d6

ImpotenceCavarject

u8

b2a0 u2

k3

d3Heart

DiseaseHirudin

Page 47: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Left Outer Web Join

b1a0

k2

d2

Cancer

Beta Carotene

b5a0

k12

d12

Alzheimer’sDisease

Ibuprofen

b3a0 d4 k5

DiabetesAlbuterol

b4a0 u4

k6

d5

Impotence Vasomax

u6u5

Page 48: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Types of web tuples Web tuples which are deleted during the

transition These tuples do not occur in the new web table

Tuples in which all the nodes have undergone content modification

Tuples in which some of the nodes are deleted and remaining ones content have changed.

Page 49: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Step 3: Generating Delta Web Tables

Page 50: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Overview Input

Joined, left outer joined and right outer joined web tables

Output Set of delta web tables

Page 51: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Delta Web Tables Delta web tables are used to represent web deltas Encapsulate the relevant changes that has occurred

in the Web with respect to a user’s query Three types

Delta+ web table • Contains a set of tuples containing new nodes

inserted during transition Delta- web table

• Set of web tuples containing nodes removed during the transition

Delta-M web table• Set of web tuples representing the previous and

current sets of modified nodes

Page 52: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Steps for Generation Phase 1: Delta Nodes Identification Phase

Nodes which are added, deleted or modified during the transition are identified

Input: Old and new version of web tables and a set of joinable nodes from the joined web table

Output: Sets of nodes which are added, deleted or modified during the transition• Nodes which exists in new web table but not in old

web table are the new nodes• Nodes which exists in old web table but not in new

one are the deleted nodes• Nodes which exists in both the web tables but are not

joinable are the nodes which has undergone content modification

Page 53: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Steps for Generation Phase 2: Delta Tuples Identification Phase

Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged

We identify those tuples which contain nodes which are added, deleted or modified during the transition

Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes

Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables

Page 54: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Phase 2 (Delta+ Web Table) Scan joined and right outer joined web tables to

identify web tuples containing nodes which are inserted during the transition

New nodes can occur in these tables only because

In the right outer joined table if the remaining nodes in the tuple containing the new nodes are modified (hence not joinable)

In the joined web table if some of the nodes in the tuple containing new nodes has remained unchanged and hence are joinable

These web tuples are stored in Delta+ Web Table

Page 55: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Right Outer Web Join)

b1a0

k2

d2

Cancer

Beta Carotene

b4a0 u9

k8

d8

Impotence Vasomax

b4a0 u12

k9

d9

Impotence Viagra

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

Page 56: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Joined Web Table)

b2a0 u3

k7

d7Heart

DisorderNiacin

a0 u2

k3

d3Heart

DiseaseHirudin

(4)

Page 57: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Delta+ Web Table

b4a0 u9

k8

d8

Impotence Vasomax

b4a0 u12

k9

d9

Impotence Viagra

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

b2a0 u3

k7

d7Heart

DisorderNiacin

Page 58: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Phase 2 (Delta- Web Table) Scan joined and left outer joined web tables to

identify web tuples containing nodes which are deleted during the transition

Deleted nodes can occur in these tables only because

In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes are modified (hence not joinable)

In the joined web table if some of the nodes in the tuple containing deleted nodes has remained unchanged and hence are joinable

These web tuples are stored in Delta- Web Table

Page 59: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Left Outer Web Join)

b1a0

k2

d2

Cancer

Beta Carotene

b5a0

k12

d12

Alzheimer’sDisease

Ibuprofen

b3a0 d4 k5

DiabetesAlbuterol

b4a0 u4

k6

d5

Impotence Vasomax

u6u5

Page 60: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Joined Web Table)

b4a0 u7

ImpotenceCavarject

b4a0 u7

k7

d6

ImpotenceCavarject

u8(5)

Page 61: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Delta- Web Table

b5a0

k12

d12

Alzheimer’sDisease

Ibuprofen

b3a0 d4 k5

DiabetesAlbuterol

b4a0 u4

k6

d5

Impotence Vasomax

u6u5

b4a0 u7

k7

d6

ImpotenceCavarject

u8

Page 62: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Phase 2 (Delta-M Web Table) Finally, nodes which are modified during the

transition can be identified by inspecting all the three web tables

Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively• These tuples do not occur in the joined web table as

all the nodes are modified Tuples in left and right outer joined tables that contain

modified nodes as well as inserted or deleted nodes• These modified nodes may not appear in the joined

web table if no other joinable web tuples contain these modified nodes

Page 63: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Right Outer Web Join)

b1a0

k2

d2

Cancer

Beta Carotene

b4a0 u9

k8

d8

Impotence Vasomax

b4a0 u12

k9

d9

Impotence Viagra

b6a0 u10

k10

d10

Parkinson’sDisease

Tolcaponeb6

Page 64: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Left Outer Web Join)

b1a0

k2

d2

Cancer

Beta Carotene

b5a0

k12

d12

Alzheimer’sDisease

Ibuprofen

b3a0 d4 k5

DiabetesAlbuterol

b4a0 u4

k6

d5

Impotence Vasomax

u6u5

Page 65: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Phase 2 Tuples in the joined web tables where some of

the nodes represent the old and new version of these modified nodes

These web tuples are stored in Delta-M Web Table

Page 66: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Example (Joined web table)

b0a0 u0

k0

d0AIDS Indavir

a0

AIDS

b0a0 u1

k1

d1

AIDSRitonavir

a0

AIDS

(1)

(2)

Page 67: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Delta-M Web Tableb0a0 u0

k0

d0AIDS Indavir

a0

AIDS

b0a0 u1

k1

d1

AIDSRitonavir

a0

AIDS

(1)

(2)

b4a0 u7

ImpotenceCavarject

b4a0 u7

k7

d6

ImpotenceCavarject

u8

(3)

Page 68: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Delta-M Web Tableb2a0 u2

k3

d3Heart

DiseaseHirudin

a0 u2

k3

d3Heart

Disorder

Hirudin

(4)

b1a0

k2

d2

Cancer

Beta Carotene

b1a0

k2

d2

Cancer

Beta Carotene

(5)

Page 69: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Applications Provides the framework for

Trend analysis E-commerce

• Consumer behaviour• Product comparisons • Competitive Intelligence• Notification Services • Provide a useful database for buyer and

sellers agents

Page 70: Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907

Future Work Analytical and empirical studies of the

algorithms for generating delta web tables Mechanism to distinguish between the

modified, new or deleted nodes Annotation on delta nodes

Extend to sub-page level Query languages for querying the changes Change notification service