Validating an Access Cost Model for Wide Area Applications

Validating an Access Cost Model for Wide Area Applications

Louiqa Raschid

University of Maryland

CoopIS 2001

Co-authors V. Zadorozhny, T. Zhan and L. Bright

L. Raschid — University of Maryland, CoopIS01

Scalable Wide-Area ApplicationsProblems Wide area environment is dynamic (noisy) Wide variability in latency (end-to-end delay) Network and server workloads are unknown Time and Day dependencies impact latency Dynamic environment - constantly monitored

Research Objective:Use query feedback to monitor and learn behavior and to predict access cost distributions that may be Time and Day dependent


Talk Outline Architecture for Wide Area Applications

WebPT: Tool to predict access costs

WebPT based Access Cost Catalog

Grouping of WebSources based on observable WebSource characteristics

Hypothesis to test WebPT based Catalog -- High Prediction Accuracy versus Low Prediction Accuracy

Validation based on experimental case study


Architecture for WebPT based Catalog


Predicting Response Times for Accessing WebSources

Problem: Difficulty in determining evaluation costs Physical implementation details unknown Load on network and WebSource unknown

Objective: •Use query feedback to learn access costs•Exploit Time of day, Day of week etc., to predict costs•Identify easily observable WebSource characteristics Determine prediction accuracy for WebSources based on WebSource characteristics


Metrics in WebPT Access Cost Model WebSource and Network Costs

Query Processing at WebSource Downloading data from WebSource (extraction cost)

Wrapper Statistics Number of Pages Accessed Cardinality of Result

Statistics may be dependent on value of query binding WebPT - a tool for learning using query feedback and

predicting access cost based on parameters such as Day, Time, Qty of data , Cardinality, etc.


WebPT Learning


WebPT based Prediction WebPT is configured for some hierarchy of dimensions

Quantity, Day,Time, Cardinality WebPT Learning algorithm

Cell splitting Smoothing Estimate response time and confidence Similar to CART (regression versus heuristics) Cell merging

Heuristics used in calibration of each cell Dimension - min/ max/ scale Allowed deviation Confidence window


Prediction Accuracy of WebPT based Cost Model is strongly correlated with the following:

Observable WebSource Characteristics Significance of Time and Day in predicting

workload at the server and on the network Variance (noise) in accessing server

Quality of available statistics - cardinality Random bindings - large variance in cardinality Fixed bindings - better estimation of cardinality


Case Study: Data gathering and Experiment 6 data sources in the public domain Data gathered for several weeks in 1999, 2000 Queries submitted to WebSources periodically Recorded TTF TTL Query bindings affected result cardinality

Random bindings - >50 bindings Fixed bindings - 2 bindings each for [S,M,L]

Mediator queries - simple scan to complex 5 way join over data in 5 WebSources (not reported)


Observable WebSource Characteristics


Grouping of WebSources based on Characteristics

•G1: T and D significant; Noise can vary•G2: Noise High•G3: T, D not significant; Noise Low - EMPTY


Hypothesis to test WebPT based Access Cost Catalog H1: High prediction Accuracy for the following

T, D, are significant and Low Noise Sources are in G1 but not in G2

H2: Catalog will improve prediction accuracy for the following WebSources T, D are significant independent of noise Group G1

H3: Statistics may be dependent on value of query binding Prediction accuracy improves with learning on fixed bindings Sources in both groups


Prediction Accuracy for WebSources

WebPT(Lo) - Random bindings


WebSource Characteristics and CorrelationWith Prediction Accuracy


Groupings of WebSources and Correlationwith Prediction Accuracy

G1: T and D significantG2: Noise HighGNIS: High Pred Accuracy G1 AND G2 FAA; FishBase: Low Pred Accuracy while in G1; Noisy


Quantile Plots of Relative Error of Prediction for ACM, Aircraft


Quantile Plot of Relative Error of Prediction for FAA, GNIS


Summary + Impact Unique Case Study: WebPT based Access Cost

Catalog and Cost distributions Grouping of WebSources based on observable

WebSource characteristics High Prediction Accuracy for some sources in G1 (T,D

significant) with low noise High Prediction Accuracy for some sources in G1 and

in G2 (High Noise) Similar results for Mediator cost model and complex

N-way joins over multiple WebSources

Documents

Validating an Access Cost Model for Wide Area Applications