Upload
erwin
View
37
Download
0
Embed Size (px)
DESCRIPTION
A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching. Xiaodan Wang, Tanu Malik, Randal Burns Johns Hopkins University Stratos Papadomanolakis, Anastassia Ailamaki Carnegie Mellon University. Overview. Motivation Data intensive scientific database federations - PowerPoint PPT Presentation
Citation preview
Hopkins Storage Systems Lab, Department of Computer Science
A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching
Xiaodan Wang, Tanu Malik, Randal Burns
Johns Hopkins UniversityStratos Papadomanolakis,
Anastassia AilamakiCarnegie Mellon University
Hopkins Storage Systems Lab, Department of Computer Science
Overview
Motivation– Data intensive scientific database federations– Mid-tier caching improves scalability
Choosing the unit of cache replacement– Minimize aggregate network traffic– Improve query execution performance
Query prototypes– Cache groups of columns– Adapts to changes in the workload
Hopkins Storage Systems Lab, Department of Computer Science
OpenSkyQuery
Federation of sky surveys (a virtual telescope)– Expected to grow from 30 sites to over 100
Available over the Internet (community of astronomers, educational users)
Sites are autonomous, heterogeneous, and geographically distributed
Data intensive workload (large data sets, network-bound)
Scaling through mid-tier caching– Minimize network traffic – Offload query processing
Hopkins Storage Systems Lab, Department of Computer Science
Caching Schema
Difficult to achieve good query performance– Caches employ commodity hardware– An index-free environment
Both network and query performance are sensitive to granularity of cache replacement
Fine granularity (column)– Poor network performance at small cache sizes– High I/O overhead
Coarse granularity (table)– Groups unrelated columns– Inefficient query and network performance
Hopkins Storage Systems Lab, Department of Computer Science
Contributions
Cache workload-defined groups of columns (query prototypes)
Adaptive – candidate query prototypes are discovered incrementally from the request stream
Self-organizing – each prototype describes a physical schema optimized for a specific class of queries
Improve in-cache query execution performance without sacrificing network savings
Hopkins Storage Systems Lab, Department of Computer Science
Caching for Network Savings
Identify and cache database objects that provide network savings
– Requests that access these objects are serviced from the cache
– Reduces contention for network bandwidth
Bypass Yield Caching (Malik et al., ICDE’05) – Caching framework that uses economic principles to
maximize network savings– Database objects are ranked by yield (expected network
savings per unit of cache space utilized)
Hopkins Storage Systems Lab, Department of Computer Science
Choosing the Unit of Cache Replacement
Semantic caching is unsuitable for Astronomy– Lack locality (objects are rarely reused)– Evaluating query containment is difficult (nested
queries, complex joins, and user-defined functions are common)
Employ schema-based caching– Queries reuse the same set of columns– Derive popular columns from the workload– Analogous materialized views
Hopkins Storage Systems Lab, Department of Computer Science
File-Bundling (Otoo et al., SC’04)
Loading only columns with high yield at small cache sizes
A B C D E F G H I J
Q1 Q2 Q3 Q4
BC
Cache
HI
Caching columns B, C, H, and I results in no cache hits Solution: cache groups of columns
Hopkins Storage Systems Lab, Department of Computer Science
Caching Groups of Columns
Existing schema-based caching models are static (e.g. CacheTables, MTCache, TimesTen)
– Do not account for dynamic workload access patterns– Physical schema of backend database or defined a priori– May group columns that are rarely used together
Query prototypes caching– Identifies the best groupings from the workload – Minimizes query execution cost against prototypes without
sacrificing network savings
Hopkins Storage Systems Lab, Department of Computer Science
Query Prototype
Given a query qi, define the Query Access Set, QAS(qi), as the set of attributes accessed by qi
qi and qj share the same query prototype if they access the same attributes (QAS(qi) = QAS(qj))
Example:
SELECT objID
FROM Galaxy, SpecObj
WHERE objID = bestobjID and specclass = 2 and z between 0.121 and 0.127
QAS = {Galaxy:objID, SpecObj:bestobjID,
SpecObj:specclass, SpecObj:z}
Hopkins Storage Systems Lab, Department of Computer Science
Query Prototype
B1A1 A2 A3
A2 A3B1 B2 B3
R1
Base Tables
R2
Q1Cache
Prototype
QAS(Q1) = {R1:A2, R1:A3, R2:B1}QAS(Q2) = {R2:B1, R2:B2, R2:B3}
Q2
B1 B2 B3
Prototype
B1 is replicated in the cache
Hopkins Storage Systems Lab, Department of Computer Science
Workload Properties
Read-only queries One-month trace against the Sloan Digital Sky
Survey (SDSS) Data Release 4 – 2TB 1.4 million queries generating 360GB of network
traffic 1176 query prototypes describe the entire workload 11 prototypes capture 91% of the queries 6 prototypes generate 89% of the network traffic
Hopkins Storage Systems Lab, Department of Computer Science
Experiments
Evaluate caching of tables, columns, vertical partitions, and query prototypes
AutoPart (Papadomanolakis et al., SSDBM’04)– An automated partitioning algorithm for large
scientific databases– Groups columns in order to improve query
execution performance– Produces the best workload-driven, static grouping
Hopkins Storage Systems Lab, Department of Computer Science
Network Savings
Hopkins Storage Systems Lab, Department of Computer Science
Cache Pollution
Hopkins Storage Systems Lab, Department of Computer Science
Query Performance
Hopkins Storage Systems Lab, Department of Computer Science
Discussion
Improving network and query execution performance are complementary goals
Columns should be grouped together at small cache sizes (cache hits suffer due to file-bundling)
Column groupings should be adaptive because– Workload access pattern is dynamic– Indexes are not available
Hopkins Storage Systems Lab, Department of Computer Science
Questions
???
Hopkins Storage Systems Lab, Department of Computer Science
Schema Reuse
• Localized to a small subset of tables
Hopkins Storage Systems Lab, Department of Computer Science
Schema Reuse
• Similar reuse among columns
Hopkins Storage Systems Lab, Department of Computer Science
Object Reuse
• Few objects are reused
Hopkins Storage Systems Lab, Department of Computer Science
SkyQuery
Federation middleware built at Hopkins
Wrapper/Mediator architecture using web services
Hopkins Storage Systems Lab, Department of Computer Science
Load CostObject Load Cost by Unit of Cache Replacement
Qry PrototypeColumn0
20
40
60
80
100
120
140
160
180
200
Unit of Cache Replacement
# W
rite
s/M
B
Hopkins Storage Systems Lab, Department of Computer Science
Scan Cost
Scanning large tables, the useful region is a small fraction
Incur IO overhead for accessing data from extraneous columns
Spatial locality among related columns
Q
Hopkins Storage Systems Lab, Department of Computer Science
Join Cost
Joining results for queries that access multiple fragments
Access should be localized to few fragments to minimize join overhead
Q