View
218
Download
1
Embed Size (px)
Citation preview
LOGO
Simon Zeltser
Towards Declarative Queries onAdaptive Data Structures
Based on the article by Nicolas Bruno and Pablo Castro
Seminar in Database Systems Technion
Contents
Introduction1
LINQ on Rich Data Structures2
LINQ Query Optimization3
Conclusions and Discussion4
Introduction
THE PROBLEM There is an increasing number of applications
that need to manage data outside the DBMS Need for a solution to simplify the interaction
between objects and data sources Current solutions lack rich declarative query
mechanismTHE NEED
Unified way to query various data sourcesTHE SOLUTION
LINQ (Language Integrated Query)
Seminar in Database Systems Technion
Introduction
LINQ : Microsoft.NET 3.5 Solution Accessing multiple data sources via the same
API Technology integrated into the programming
language Supports operations:
Traversal – grouping, joins Filter – which rows Projection –which columns
var graduates = from student in students where student.Degree = “Graduate”
orderby student.Name, student.Gender,
student.Age select student;
BUT… The default implementation is simplistic Appropriate for small ad-hoc structures in
memory
Seminar in Database Systems Technion
Introduction
THE GOAL OF THIS SESSION Introduce LINQ key principles Show model of customization of LINQ’s
Execution Model on Rich Data Structures Evaluate the results
Seminar in Database Systems Technion
LINQ – Enabled Data Sources
LINQ – High Level Architecture
Seminar in Database Systems Technion
C# 3.0 Visual Basic Other Languages…
LINQ To Objects
LINQ To Datasets
LINQ To XML
<xml>
Objects Databases XML
.NET Language Integrated Query (LINQ)
LINQ To SQL
LINQ To Entities
Compare two approaches
IterationList<String> matches = new
List<String>();
// Find the matchesforeach (string item in data) {
if (item.StartsWith("Eric")) {matches.Add(item);
}}
// Sort the matchesmatches.Sort();
// Print out the matchesforeach (string item in matches)}
Console.WriteLine(item);{
LINQ// Find and sort matchesvar matches = from n in data
where n.StartsWith("Eric")orderby nselect n;
// Print out the matchesforeach (var match in
matches)}
Console.WriteLine(match);{
Seminar in Database Systems Technion
Language Integration
Lambda Expressions
Functionint StringLength(String s) { return s.Length();{
QuerySyntax
var matches = from n in data where n.StartsWith("Eric") orderby n select n;
ExtensionMethods
public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source,
Func<TSource, bool> predicate)
Anonymous
Types
var name = "Eric";var age = 43;var person = new { Name = "Eric", Age = 43 };var names = new [] {"Eric", "Ryan", "Paul" };foreach (var item in names)
Lambda Expression
s => s.Length();
var matches = data .Where(n => n.StartsWith("Eric")) .OrderBy(n => n) .Select(n => n)
Seminar in Database Systems Technion
LINQ - Example
// Retrieve all CS students with more // than 105 pointsvar query =
from stud in studentswhere ( stud.Faculty == “CS” && stud.Points > 105)orderby stud.Points descendingselect new { Details = stud.Name +
“:” + stud.Phone };
// Iterate over resultsforeach(var student in query) {
Console.WriteLine(student.Details);}
Seminar in Database Systems Technion
Lambda Expressions
QuerySyntax
ExtensionMethods
AnonymousTypes
Customizing LINQ Execution Model
EXPRESSION TREES LINQ represents queries as in-memory abstract syntax
tree Query description and implementation are not tied
together
THE PROBLEM The default implementation of the operations uses
fixed, general purpose algorithms
SUGGESTED SOLUTION Change how the query is executed without changing
how it’s expressed Analyze alternative implementations of a given query
and dynamically choose the most appropriate version depending on the context.
Seminar in Database Systems Technion
1 5 7
+
*
Customizing LINQ Execution Model (2)
PROBLEM EXAMPLE WHERE operator is implemented by performing a
sequential scan over the input and evaluating the selection predicate on each tuple!
Seminar in Database Systems Technion
var q = A.Where(x=>x<5).Select(x=>2*x);
int[] A = {1, 2, 3, 10, 20, 30};var q = from x in A
where x < 5 select 2*x;
foreach(int i in q)Console.WriteLine(i);
IEnumerable<int> res = new List<int>();foreach(int a in A)
if (AF1(a)) res.Add(AF2(a));return res;
IEnumerable<int> q = Enumerable.Project( Enumerable.Where(A, AF1), AF2);
bool AF1(int x) { return x<5; }int AF2(int x) { return 2*x; }
1
2
3Query Implementation:
Rich Data Structures - DataSet
DataSet object
DataTable object
DataRow
DataColumn
DataTable object
UniqueConstraint
UniqueConstraint
ForeignKeyConstraint
In-memory cache of data Typically populated from a database Supports indexing of DataColumns
via DataViews
Seminar in Database Systems Technion
We will use LINQ on DataSet for demonstrating query optimization techniques
LINQ on Rich Data Structures
Enable LINQ to work over DataSets.EXAMPLE Given R and S – two DataTables
Seminar in Database Systems Technion
from r in R.AsEnumerable()join s in S.AsEnumerable()
on r.Field<int>(“x”) equals s.Field<int>(“y”)
select new { a = r.Field<int>(“a”), b = s.Field<int>(“b”) };
LINQ on DataSet
Standard C# Code
Interm. Language
Expression Tree
OptimizedExpression
Tree
Interm.Language
DataSetSelf-tuningState
Compile and run-time phases on an implementation of our prototype
Compile Time Run Time
Expression Tree Optimizer
Seminar in Database Systems Technion
Cost ModelQuery Cost Estimator
StatisticsManager
Self Tuning Organizer
QueryAnalyzer
IndexReorganizer
OscillationManager
Our solution will be built according to the following architecture
Query Cost Estimator
Seminar in Database Systems Technion
Cost ModelStatisticsManager
Self Tuning Organizer
QueryAnalyzer
IndexReorganizer
OscillationManager
Query Cost Estimator
Query Estimation - Cost Model
Follow traditional database approach: COST: {execution plans} -> [expected
execution time] Relies on:
a set of statistics maintained in DataTables for some of its columns
formulas to estimate selectivity of predicates and cardinality of sub-plans
formulas to estimate the expected costs of query execution for every operator
Seminar in Database Systems Technion
Cardinality EstimationReturns an approximate number of
rows that each operator in a query plan would output To reduce the overhead, we will use only
these statistical estimators: maxVal – maximum number of distinct
values minVal – minimum number of distinct
values dVal – number of distinct values in a
column If statistics are unavailable, rely on “magic
numbers” until automatically creation of statistics
Seminar in Database Systems Technion
Predicate Selectivity Estimation
Let: σp(T ) be an arbitrary expression.
The cardinality of T is defined: Card(σp(T )) =sel(p)· Under this definition we define:
COSTT(Execution Plan) = Σ (COST(p))EXAMPLE: Consider full table scan of
table T): COST(T) = Card(T) * MEM_ACCESS_COST
Seminar in Database Systems Technion
Selectivity Estimation Predicate
sel(p1)· sel(p2) sel(p1 ^ p2)
sel(p1) + sel(p2)−sel(p1 ^ p2) sel(p1 v p2)
(dVal(c))-1 sel(c = c0)
sel(c0<=c<=c1)
For each p in {operators of T}
Average Cost Of Memory Access
Predicate Selectivity Estimation
Let: σp(T ) be an arbitrary expression.
The cardinality of T is defined: Card(σp(T )) =sel(p)· Under this definition we define:
COSTT(Execution Plan) = Σ (COST(p))EXAMPLE: Consider full table scan of
table T: COST(T) = Card(T) * MEM_ACCESS_COST
Seminar in Database Systems Technion
Selectivity Estimation Predicate
sel(p1)· sel(p2) sel(p1 ^ p2)
sel(p1) + sel(p2)−sel(p1 ^ p2) sel(p1 v p2)
(dVal(c))-1 sel(c = c0)
sel(c0<=c<=c1)
For each p in {operators of T}
Average Cost Of Memory Access
c0minVal(c)
maxVal(c)Intuition:
We model sel(co<=c<=c1) as the probability to get a “c” value in interval [c0, c1] among all possible “c” values
c1
c
Predicate Selectivity Estimation
Let: σp(T ) be an arbitrary expression.
The cardinality of T is defined: Card(σp(T )) =sel(p)· Under this definition we define:
COSTT(Execution Plan) = Σ (COST(p))EXAMPLE: Consider full table scan of
table T): COST(T) = Card(T) * MEM_ACCESS_COST
Seminar in Database Systems Technion
Selectivity Estimation Predicate
sel(p1)· sel(p2) sel(p1 ^ p2)
sel(p1) + sel(p2)−sel(p1 ^ p2) sel(p1 v p2)
(dVal(c))-1 sel(c = c0)
sel(c0<=c<=c1)
For each p in {operators of T}
Average Cost Of Memory Access
Consider now a join predicate: T1 c1=c2T2
Card(T1 c1=c2 T2)=
)(
)(*)(
)(*))(),(min(
2
2
1
121
cdVal
TCard
cdVal
TCardcdValcdVal
Query Analyzer
Seminar in Database Systems Technion
Cost ModelStatisticsManager
Self Tuning Organizer
QueryAnalyzer
IndexReorganizer
OscillationManager
Query Cost Estimator
Execution Alternatives
Rely on indexes on DataColumns when possible
Example: σa=7∧(b+c)<20
Seminar in Database Systems Technion
Full Table Scan a=7 b+c < 20
5
3 7
2 4
Index on “a”column
c b a
3 1 776 3 232 34 58 14 79 9 423 4 73 1 3
c b a
3 1 776 3 232 34 58 14 79 9 923 4 73 1 8
c b a
3 1 776 3 232 34 58 14 79 9 423 4 73 1 3
c b a
3 1 776 3 232 34 58 14 79 9 423 4 73 1 3
c b a
3 1 776 3 232 34 58 14 79 9 423 4 73 1 3
Alternative 1: Alternative 2:
Analyzing Execution Plans Global vs. Local Execution Plan –
EXAMPLE:
Seminar in Database Systems Technion
Join
Products Join
Carts Filter
Customers
Global Execution PlanLocal Execution Plan
HashJoin? IndexJoin? MergeJoin?
Enumeration Architecture Two phases:
First phase: Join reordering based on estimated cardinalities
Second phase: Choose the best physical implementation for each operator
EXAMPLE: Suppose we analyze JOIN operator. We evaluate the following JOIN implementations:
Hash Join Merge Join (inputs must be sorted in the join
columns) Index Join (index on the inner join column
must be available) Other possible calculation options
Choose the alternative with the smallest cost
Seminar in Database Systems Technion
Query Analysis
Seminar in Database Systems Technion
Cost ModelStatisticsManager
Self Tuning Organizer
QueryAnalyzer
IndexReorganizer
OscillationManager
Query Cost Estimator
Self Tuning OrganizationWe want to reach the smallest query
execution time. Indexes can be used to speedup query
executionPROBLEM:
It might become problematic to forecast in advance what indexes to build for optimum performance
SOLUTION: Continuous monitoring/tuning component
that addresses the challenge of choosing and building adequate indexes and statistics automatically
Seminar in Database Systems Technion
Self Tuning Organization - Example
Consider the following execution plan:
Seminar in Database Systems Technion
The selection predicate Name=“Pam” over Customers DataTable can be improved if an index on Customers(Name) is built
Both hash joins can be improved if indexes I2 and I3 are available, since we can transform hash join into index join* The three sub-plans
enclosed in dotted lines might be improved if suitable indexes were present
Index TuningHigh-Level Description:
Identify a good set of candidate indexes that would improve performance if they were available.
Later, when the optimized queries are evaluated, we aggregate the relative benefits of both candidate and existing indexes.
Based on this information, we periodically trigger index creations or deletions, taking into account storage constraints, overall utility of the resulting indexes, and the cost to creating and maintaining them.Seminar in Database Systems Technion
Index tuning algorithmNotation:
H – a set of candidate indexes to materialize T – task set for query qi
Ii – either a candidate or an existing index δIi – amount that I would speed up query q
Seminar in Database Systems Technion
Task Set
I1, δI1 I2, δI2 In, δIn . . …
H (initially empty)
Index tuning algorithmNotation:
ΔI – value maintained for each index I Materialized index – already created one
SELECT query: ΔI = ΔI + δI
UPDATE query: ΔI = ΔI – δI
Seminar in Database Systems Technion
Task Set
I1, δI1 I2, δI2 In, δIn . . …
H
I1, δI1
I1
Index Tuning Algorithm
The purpose of ΔI:
Seminar in Database Systems Technion
We maintain ΔI on every query evaluation
If the potential aggregated benefit of materializing a candidate index exceeds its creation cost, we should create it, since we gathered enough evidence that the index is useful
Index tuning algorithm
Remove “bad” indexes phaseNotation:
Δmin – minimum Δ value for index I Δmax – maximum Δ value for index I BI – the cost of creating index I Residual(I) = BI – (Δmax – Δ)
(the “slack” an index has before being deemed “droppable”)IF (Residual(I)) <= 0) THEN Drop(I)
Net-Benefit(I) = (Δ-Δmin)-BI
(the benefit from creating the index)IF (Net-Benefit(I) >= 0) THEN Add(I)
Seminar in Database Systems Technion
Index tuning algorithm
Notation: ITM – all the indexes from H which creation is
cost effective ITD – subset of existing indexes such that:
ITD fits in existing memory It’s still cost effective to create new index I
after possibly dropping members from ITD
If creating index I is more effective than maintaining existing indexes in ITD, DROP(ITD) && CREATE(I)
Remove I from H (set of candidate indexes to materialize)
Seminar in Database Systems Technion
Experimental Evaluation
Seminar in Database Systems Technion
checkCarts($1) =from p in Products.AsEnumerable()join cart in Carts.AsEnumerable()
on p.Field<int>("id") equals cart.Field<int>("p_id")join c in Customers.AsEnumerable()
on cart.Field<int>("cu_id") equals c.Field<int>("id")where c.name = $1 select new { cart, p }
Possible IndexesI1 Categories(par_id)I2 Products(c_id)I3 Carts(cu_id)I4 Products(ca_id)I5 Customers(name)
browseProducts($1) =from p in Products.AsEnumerable()join c in Categories.AsEnumerable()on p.Field<int>("ca_id") equalsc.Field<int>("id")where c.par id = $1select pGenerated:
• 200,000 products• 50,000 customers• 1,000 categories• 5,000 items in the shopping
carts
Consider the following schema:
Experimental Evaluation – Cont.
Seminar in Database Systems Technion
Generated schedule when tuning was disabled
Experimental Evaluation – Cont.
Seminar in Database Systems Technion
Generated schedule when tuning was enabled
Summary
We’ve discussed: LINQ – for declarative query formulation DataSet - a uniform way of representing in-
memory data. A lightweight optimizer for automatically
adjusting query execution strategies
Article’s main contribution: NOT a new query processing technique BUT: careful engineering of traditional
database concepts in a new context
Seminar in Database Systems Technion
LINQ Execution Model
Seminar in Database Systems Technion
Compiler merges LINQ
extension methods
Query syntax is converted to function calls and lambda expressions
Lambda expressions are converted to expression trees
Compiler finds a query pattern
Query is executed
lazily
Compiler infers types produced by queries
Adds query operations to IEnumerable<T>
At compile time Expressions are evaluated at run-time
Parsed and type checked at compile-time
Datasets are strongly typed
Operations ondata sets are strongly typed
Specialized or base Can optimize and re-write query
Expressions and operationscan execute remotely At run-time, when results are used We can force evaluations (ToArray())