24
WEDAGEN: A Synthetic Web Database Generator

WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

WEDAGEN: A Synthetic Web Database Generator

Page 2: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Presentation Outline Existing WWW search mechanisms WHOWEDA: A Warehouse of Web Data Modular structure of WEDAGEN Configuration parameters Performance evaluation Summary and future work

Page 3: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Existing W3 Search Mechanisms Time delay in manual navigation of the

web Overwhelming results and unwanted

information No tool for organizing and storing

harnessed information for further manipulation

Page 4: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Existing W3 Search Mechanisms Time delay in manual navigation of the

web Overwhelming results and unwanted

information No tool for organizing and storing

harnessed information for further manipulation

Search engines and browsers are not always the best ways to systematically harness information from the web

Page 5: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Existing W3 Search Mechanisms Time delay in manual navigation of the

web Overwhelming results and unwanted

information No tool for organizing and storing

harnessed information for further manipulation

Search engines and browsers are not always the best ways to systematically harness information from the web

The WHOWEDA approach @ NTU

Page 6: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Overview of WHOWEDA A web warehousing system to store and

manipulate web information Store extracted information as ‘web tables’

and provide ‘web operators’ to manipulate web tables

To extract information from W3, user defines a ‘query graph’

Results of extraction is a set of web tuples; each tuple instantiates the query graph

More information: http://www.cais.ntu.edu.sg:8000/

~whoweda

Page 7: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Example: Query graph (web schema)

N1.URL EQUALS “http://sunsite.doc.ic.ac.uk/ bySubject/Computing/ UniSciDepts.html”L2.LABEL EQUALS “faculty”L3.LABEL EQUALS “research projects”L4.LABEL CONTAINS “publications”L5.LABEL CONTAINS “publications”N5.TEXT CONTAINS “Internet computing”

N1 N2 N3

N4 N5

Page 8: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Example: Query results

Id Name Age

A1 John 23

C2 Wendy 35

B4 Jane 25

A2 Wendy 35

C9 Pete 42

B3 Kim 38

F8 Tom 22

G7 Cindy 47

Page 9: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Objectives Need to perform systematic evaluation of

web operators during WHOWEDA development

Limitations of testing using real web data To design a testbed that is controllable,

comprehensive and systematic for evaluating web database systems

To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas

Page 10: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Objectives Need to perform systematic evaluation of

web operators during WHOWEDA development

Limitations of testing using real web data To design a testbed that is controllable,

comprehensive and systematic for evaluating web database systems

To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas

WEBAGEN: A Web Database Generator

Page 11: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

System Architecture of WEDAGEN

Page 12: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Configuration Input Parameters

WEDAGEN parameters

Default SpecificSelectivity

Instance Related

Control

NumTuplesNumSourceNodeInstancesFanOutNumKeyWordsPerNodeInstanceNumWordsPerNodeInstanceNumWordsPerLinkLabelNumWordsPerHostNameNumWordsPerTitleLocalGlobalLink

NumSourceNodeInstancesFanOutNumKeyWordsPerNodeInstanceNumWordsPerNodeInstanceNumWordsPerLinkLabelNumWordsPerTitleNumWordsPerHostNameLocalGlobalLink

NodeSelectivityTableSelectivity

Web Schema

Fan-In

Page 13: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Parameter Values Suggestion

StartStartGenerate specific

parameter values

Generate specificparameter

values

user changespecific

parameters

user changespecific

parameters

Calculate max. no. of tuples

to be generated

Calculate max. no. of tuples

to be generated

Is calculated

value >NumTuples

Is calculated

value >NumTuples

Calculate NumSourceNodeInstances

to generate specified number of tuples

Calculate NumSourceNodeInstances

to generate specified number of tuples

Store

suggested

values

in file

Store

suggested

values

in file

User changespecific

parameters

User changespecific

parametersEndEnd

Invoke instance

generation module

Invoke instance

generation module

Page 14: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Instance Generation Module (IGM)

1. No. of node

instance generato

r

NumSourceNode

Instances

Fanout

No. of Node

Instancesper node

2. URLgenerato

r

3. Nodeinstanceattributegenerato

r

4. Linkset

generator

5. Webpage

generator

Numwordsper URL

URLs ofall

nodeinstances

Linkset

of eachinstance

Nodeattribute

se.g. title,

text, date

NumSourceNode

Instances

Numwords

per nodeinstance

Images

webpage

Numwords

pertitle

NodePool

Webpages

Webtables

TupleExtractionModule

Page 15: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Directed Graph Output from IGM

Page 16: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Tuple Extraction Module (TEM) IGM generates all node and link instances

interconnected as directed graph(s) TEM extracts and constructs individual web

tuples from the directed graph(s) Node and link instances have IDs assigned Web tuples stored in a web table file A web table has been constructed that is

complete with node, link and tuple information

Page 17: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Extracted Web Tuples

Page 18: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Preliminary Evaluation Elapsed time used to measure overhead of

web table generation A set of sample test configurations

identified consisting of typical combinations of 4 web schemas and input parameters

Performance measured with respect to: Complexity of schema Total number of node instances and

total number of tuples

Page 19: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Four Test Schemas

Page 20: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Parameter Small Medium LargeNumTuples 600 2,000 5,000NumSourceNodeInstances-Schema 1-Schema 2-Schema 3-Schema 4

2411

24

2011

20

8418

FanOut 5 10 25NumWordsPerNodeInstance 200 100 00NumWordsPerLinkLabel 6 10 10NumWordsPerHostname 5 10 10NumWordsPerTitle 6 10 10LocalGlobalLink 0 0 0NodeSelectivity 60 90 90

Three Table Sizes

Page 21: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Elapsed Time Vs No. of Tuples

Page 22: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Experimental Findings Time elapsed in generating web table

increases with size of table Rate of growth is different for different

schemas; i.e., schema complexity affects elapsed time Generating table of tree schema (schema

2) takes longer than that of linear schema (schema 1)

Generating table of schema 2 takes longer than that of schema 4

Page 23: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Summary Identified parameters to create web data

of different sizes and complexities successfully determined

Designed and implemented WEDAGEN and has been successfully integrated into the WHOWEDA system

Able to scale up well with increasing web schema complexity and web table size

Time and effort required to evaluate web database system performance can be reduced with WEBAGEN

Page 24: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure

Future Work Inclusion of more parameters:

Minimum and maximum depth of a tuple.

Average ratio of bound and unbound nodes in a tuple.

Apply WEDAGEN to other database systems similar to WHOWEDA

Develop WHOWEDA into a full-fledged benchmark toolkit