View
219
Download
2
Tags:
Embed Size (px)
Citation preview
WEDAGEN: A Synthetic Web Database Generator
Presentation Outline Existing WWW search mechanisms WHOWEDA: A Warehouse of Web Data Modular structure of WEDAGEN Configuration parameters Performance evaluation Summary and future work
Existing W3 Search Mechanisms Time delay in manual navigation of the
web Overwhelming results and unwanted
information No tool for organizing and storing
harnessed information for further manipulation
Existing W3 Search Mechanisms Time delay in manual navigation of the
web Overwhelming results and unwanted
information No tool for organizing and storing
harnessed information for further manipulation
Search engines and browsers are not always the best ways to systematically harness information from the web
Existing W3 Search Mechanisms Time delay in manual navigation of the
web Overwhelming results and unwanted
information No tool for organizing and storing
harnessed information for further manipulation
Search engines and browsers are not always the best ways to systematically harness information from the web
The WHOWEDA approach @ NTU
Overview of WHOWEDA A web warehousing system to store and
manipulate web information Store extracted information as ‘web tables’
and provide ‘web operators’ to manipulate web tables
To extract information from W3, user defines a ‘query graph’
Results of extraction is a set of web tuples; each tuple instantiates the query graph
More information: http://www.cais.ntu.edu.sg:8000/
~whoweda
Example: Query graph (web schema)
N1.URL EQUALS “http://sunsite.doc.ic.ac.uk/ bySubject/Computing/ UniSciDepts.html”L2.LABEL EQUALS “faculty”L3.LABEL EQUALS “research projects”L4.LABEL CONTAINS “publications”L5.LABEL CONTAINS “publications”N5.TEXT CONTAINS “Internet computing”
N1 N2 N3
N4 N5
Example: Query results
Id Name Age
A1 John 23
C2 Wendy 35
B4 Jane 25
A2 Wendy 35
C9 Pete 42
B3 Kim 38
F8 Tom 22
G7 Cindy 47
Objectives Need to perform systematic evaluation of
web operators during WHOWEDA development
Limitations of testing using real web data To design a testbed that is controllable,
comprehensive and systematic for evaluating web database systems
To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas
Objectives Need to perform systematic evaluation of
web operators during WHOWEDA development
Limitations of testing using real web data To design a testbed that is controllable,
comprehensive and systematic for evaluating web database systems
To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas
WEBAGEN: A Web Database Generator
System Architecture of WEDAGEN
Configuration Input Parameters
WEDAGEN parameters
Default SpecificSelectivity
Instance Related
Control
NumTuplesNumSourceNodeInstancesFanOutNumKeyWordsPerNodeInstanceNumWordsPerNodeInstanceNumWordsPerLinkLabelNumWordsPerHostNameNumWordsPerTitleLocalGlobalLink
NumSourceNodeInstancesFanOutNumKeyWordsPerNodeInstanceNumWordsPerNodeInstanceNumWordsPerLinkLabelNumWordsPerTitleNumWordsPerHostNameLocalGlobalLink
NodeSelectivityTableSelectivity
Web Schema
Fan-In
Parameter Values Suggestion
StartStartGenerate specific
parameter values
Generate specificparameter
values
user changespecific
parameters
user changespecific
parameters
Calculate max. no. of tuples
to be generated
Calculate max. no. of tuples
to be generated
Is calculated
value >NumTuples
Is calculated
value >NumTuples
Calculate NumSourceNodeInstances
to generate specified number of tuples
Calculate NumSourceNodeInstances
to generate specified number of tuples
Store
suggested
values
in file
Store
suggested
values
in file
User changespecific
parameters
User changespecific
parametersEndEnd
Invoke instance
generation module
Invoke instance
generation module
Instance Generation Module (IGM)
1. No. of node
instance generato
r
NumSourceNode
Instances
Fanout
No. of Node
Instancesper node
2. URLgenerato
r
3. Nodeinstanceattributegenerato
r
4. Linkset
generator
5. Webpage
generator
Numwordsper URL
URLs ofall
nodeinstances
Linkset
of eachinstance
Nodeattribute
se.g. title,
text, date
NumSourceNode
Instances
Numwords
per nodeinstance
Images
webpage
Numwords
pertitle
NodePool
Webpages
Webtables
TupleExtractionModule
Directed Graph Output from IGM
Tuple Extraction Module (TEM) IGM generates all node and link instances
interconnected as directed graph(s) TEM extracts and constructs individual web
tuples from the directed graph(s) Node and link instances have IDs assigned Web tuples stored in a web table file A web table has been constructed that is
complete with node, link and tuple information
Extracted Web Tuples
Preliminary Evaluation Elapsed time used to measure overhead of
web table generation A set of sample test configurations
identified consisting of typical combinations of 4 web schemas and input parameters
Performance measured with respect to: Complexity of schema Total number of node instances and
total number of tuples
Four Test Schemas
Parameter Small Medium LargeNumTuples 600 2,000 5,000NumSourceNodeInstances-Schema 1-Schema 2-Schema 3-Schema 4
2411
24
2011
20
8418
FanOut 5 10 25NumWordsPerNodeInstance 200 100 00NumWordsPerLinkLabel 6 10 10NumWordsPerHostname 5 10 10NumWordsPerTitle 6 10 10LocalGlobalLink 0 0 0NodeSelectivity 60 90 90
Three Table Sizes
Elapsed Time Vs No. of Tuples
Experimental Findings Time elapsed in generating web table
increases with size of table Rate of growth is different for different
schemas; i.e., schema complexity affects elapsed time Generating table of tree schema (schema
2) takes longer than that of linear schema (schema 1)
Generating table of schema 2 takes longer than that of schema 4
Summary Identified parameters to create web data
of different sizes and complexities successfully determined
Designed and implemented WEDAGEN and has been successfully integrated into the WHOWEDA system
Able to scale up well with increasing web schema complexity and web table size
Time and effort required to evaluate web database system performance can be reduced with WEBAGEN
Future Work Inclusion of more parameters:
Minimum and maximum depth of a tuple.
Average ratio of bound and unbound nodes in a tuple.
Apply WEDAGEN to other database systems similar to WHOWEDA
Develop WHOWEDA into a full-fledged benchmark toolkit