LOGO Simon Zeltser Towards Declarative Queries on Based on the article by Nicolas Bruno and Pablo Castro

LOGO

Simon Zeltser

Towards Declarative Queries onAdaptive Data Structures

Based on the article by Nicolas Bruno and Pablo Castro

Seminar in Database Systems Technion

Contents

Introduction1

LINQ on Rich Data Structures2

LINQ Query Optimization3

Conclusions and Discussion4

Introduction

THE PROBLEM There is an increasing number of applications

that need to manage data outside the DBMS Need for a solution to simplify the interaction

between objects and data sources Current solutions lack rich declarative query

mechanismTHE NEED

Unified way to query various data sourcesTHE SOLUTION

LINQ (Language Integrated Query)


Introduction

LINQ : Microsoft.NET 3.5 Solution Accessing multiple data sources via the same

API Technology integrated into the programming

language Supports operations:

Traversal – grouping, joins Filter – which rows Projection –which columns

var graduates = from student in students where student.Degree = “Graduate”

orderby student.Name, student.Gender,

student.Age select student;

BUT… The default implementation is simplistic Appropriate for small ad-hoc structures in

memory


Introduction

THE GOAL OF THIS SESSION Introduce LINQ key principles Show model of customization of LINQ’s

Execution Model on Rich Data Structures Evaluate the results


LINQ – Enabled Data Sources

LINQ – High Level Architecture


C# 3.0 Visual Basic Other Languages…

LINQ To Objects

LINQ To Datasets

LINQ To XML

<xml>

Objects Databases XML

.NET Language Integrated Query (LINQ)

LINQ To SQL

LINQ To Entities

Compare two approaches

IterationList<String> matches = new

List<String>();

// Find the matchesforeach (string item in data) {

if (item.StartsWith("Eric")) {matches.Add(item);

}}

// Sort the matchesmatches.Sort();

// Print out the matchesforeach (string item in matches)}

Console.WriteLine(item);{

LINQ// Find and sort matchesvar matches = from n in data

where n.StartsWith("Eric")orderby nselect n;

// Print out the matchesforeach (var match in

matches)}

Console.WriteLine(match);{


Language Integration

Lambda Expressions

Functionint StringLength(String s) { return s.Length();{

QuerySyntax

var matches = from n in data where n.StartsWith("Eric") orderby n select n;

ExtensionMethods

public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source,

Func<TSource, bool> predicate)

Anonymous

Types

var name = "Eric";var age = 43;var person = new { Name = "Eric", Age = 43 };var names = new [] {"Eric", "Ryan", "Paul" };foreach (var item in names)

Lambda Expression

s => s.Length();

var matches = data .Where(n => n.StartsWith("Eric")) .OrderBy(n => n) .Select(n => n)


LINQ - Example

// Retrieve all CS students with more // than 105 pointsvar query =

from stud in studentswhere ( stud.Faculty == “CS” && stud.Points > 105)orderby stud.Points descendingselect new { Details = stud.Name +

“:” + stud.Phone };

// Iterate over resultsforeach(var student in query) {

Console.WriteLine(student.Details);}


Lambda Expressions

QuerySyntax

ExtensionMethods

AnonymousTypes

Customizing LINQ Execution Model

EXPRESSION TREES LINQ represents queries as in-memory abstract syntax

tree Query description and implementation are not tied

together

THE PROBLEM The default implementation of the operations uses

fixed, general purpose algorithms

SUGGESTED SOLUTION Change how the query is executed without changing

how it’s expressed Analyze alternative implementations of a given query

and dynamically choose the most appropriate version depending on the context.


1 5 7

+

*

Customizing LINQ Execution Model (2)

PROBLEM EXAMPLE WHERE operator is implemented by performing a

sequential scan over the input and evaluating the selection predicate on each tuple!


var q = A.Where(x=>x<5).Select(x=>2*x);

int[] A = {1, 2, 3, 10, 20, 30};var q = from x in A

where x < 5 select 2*x;

foreach(int i in q)Console.WriteLine(i);

IEnumerable<int> res = new List<int>();foreach(int a in A)

if (AF1(a)) res.Add(AF2(a));return res;

IEnumerable<int> q = Enumerable.Project( Enumerable.Where(A, AF1), AF2);

bool AF1(int x) { return x<5; }int AF2(int x) { return 2*x; }

1

2

3Query Implementation:

Rich Data Structures - DataSet

DataSet object

DataTable object

DataRow

DataColumn

DataTable object

UniqueConstraint

UniqueConstraint

ForeignKeyConstraint

In-memory cache of data Typically populated from a database Supports indexing of DataColumns

via DataViews


We will use LINQ on DataSet for demonstrating query optimization techniques

LINQ on Rich Data Structures

Enable LINQ to work over DataSets.EXAMPLE Given R and S – two DataTables


from r in R.AsEnumerable()join s in S.AsEnumerable()

on r.Field<int>(“x”) equals s.Field<int>(“y”)

select new { a = r.Field<int>(“a”), b = s.Field<int>(“b”) };

LINQ on DataSet

Standard C# Code

Interm. Language

Expression Tree

OptimizedExpression

Tree

Interm.Language

DataSetSelf-tuningState

Compile and run-time phases on an implementation of our prototype

Compile Time Run Time

Expression Tree Optimizer


Cost ModelQuery Cost Estimator

StatisticsManager

Self Tuning Organizer

QueryAnalyzer

IndexReorganizer

OscillationManager

Our solution will be built according to the following architecture

Query Cost Estimator


Cost ModelStatisticsManager


QueryAnalyzer

IndexReorganizer

OscillationManager


Query Estimation - Cost Model

Follow traditional database approach: COST: {execution plans} -> [expected

execution time] Relies on:

a set of statistics maintained in DataTables for some of its columns

formulas to estimate selectivity of predicates and cardinality of sub-plans

formulas to estimate the expected costs of query execution for every operator


Cardinality EstimationReturns an approximate number of

rows that each operator in a query plan would output To reduce the overhead, we will use only

these statistical estimators: maxVal – maximum number of distinct

values minVal – minimum number of distinct

values dVal – number of distinct values in a

column If statistics are unavailable, rely on “magic

numbers” until automatically creation of statistics


Predicate Selectivity Estimation

Let: σp(T ) be an arbitrary expression.

The cardinality of T is defined: Card(σp(T )) =sel(p)· Under this definition we define:

COSTT(Execution Plan) = Σ (COST(p))EXAMPLE: Consider full table scan of

table T): COST(T) = Card(T) * MEM_ACCESS_COST


Selectivity Estimation Predicate

sel(p1)· sel(p2) sel(p1 ^ p2)

sel(p1) + sel(p2)−sel(p1 ^ p2) sel(p1 v p2)

(dVal(c))-1 sel(c = c0)

sel(c0<=c<=c1)

For each p in {operators of T}

Average Cost Of Memory Access





table T: COST(T) = Card(T) * MEM_ACCESS_COST






sel(c0<=c<=c1)



c0minVal(c)

maxVal(c)Intuition:

We model sel(co<=c<=c1) as the probability to get a “c” value in interval [c0, c1] among all possible “c” values

c1

c





table T): COST(T) = Card(T) * MEM_ACCESS_COST






sel(c0<=c<=c1)



Consider now a join predicate: T1 c1=c2T2

Card(T1 c1=c2 T2)=

)(

)(*)(

)(*))(),(min(

2

2

1

121

cdVal

TCard

cdVal

TCardcdValcdVal

Query Analyzer




QueryAnalyzer

IndexReorganizer

OscillationManager


Execution Alternatives

Rely on indexes on DataColumns when possible

Example: σa=7∧(b+c)<20


Full Table Scan a=7 b+c < 20

5

3 7

2 4

Index on “a”column

c b a

3 1 776 3 232 34 58 14 79 9 423 4 73 1 3

c b a

3 1 776 3 232 34 58 14 79 9 923 4 73 1 8

c b a

3 1 776 3 232 34 58 14 79 9 423 4 73 1 3

c b a

3 1 776 3 232 34 58 14 79 9 423 4 73 1 3

c b a

3 1 776 3 232 34 58 14 79 9 423 4 73 1 3

Alternative 1: Alternative 2:

Analyzing Execution Plans Global vs. Local Execution Plan –

EXAMPLE:


Join

Products Join

Carts Filter

Customers

Global Execution PlanLocal Execution Plan

HashJoin? IndexJoin? MergeJoin?

Enumeration Architecture Two phases:

First phase: Join reordering based on estimated cardinalities

Second phase: Choose the best physical implementation for each operator

EXAMPLE: Suppose we analyze JOIN operator. We evaluate the following JOIN implementations:

Hash Join Merge Join (inputs must be sorted in the join

columns) Index Join (index on the inner join column

must be available) Other possible calculation options

Choose the alternative with the smallest cost


Query Analysis




QueryAnalyzer

IndexReorganizer

OscillationManager


Self Tuning OrganizationWe want to reach the smallest query

execution time. Indexes can be used to speedup query

executionPROBLEM:

It might become problematic to forecast in advance what indexes to build for optimum performance

SOLUTION: Continuous monitoring/tuning component

that addresses the challenge of choosing and building adequate indexes and statistics automatically


Self Tuning Organization - Example

Consider the following execution plan:


The selection predicate Name=“Pam” over Customers DataTable can be improved if an index on Customers(Name) is built

Both hash joins can be improved if indexes I2 and I3 are available, since we can transform hash join into index join* The three sub-plans

enclosed in dotted lines might be improved if suitable indexes were present

Technion

Algorithm for automatic index tuning

Seminar in Database Systems

Index TuningHigh-Level Description:

Identify a good set of candidate indexes that would improve performance if they were available.

Later, when the optimized queries are evaluated, we aggregate the relative benefits of both candidate and existing indexes.

Based on this information, we periodically trigger index creations or deletions, taking into account storage constraints, overall utility of the resulting indexes, and the cost to creating and maintaining them.Seminar in Database Systems Technion

Technion



Index tuning algorithmNotation:

H – a set of candidate indexes to materialize T – task set for query qi

Ii – either a candidate or an existing index δIi – amount that I would speed up query q


Task Set

I1, δI1 I2, δI2 In, δIn . . …

H (initially empty)

Technion



Index tuning algorithmNotation:

ΔI – value maintained for each index I Materialized index – already created one

SELECT query: ΔI = ΔI + δI

UPDATE query: ΔI = ΔI – δI


Task Set

I1, δI1 I2, δI2 In, δIn . . …

H

I1, δI1

I1

Index Tuning Algorithm

The purpose of ΔI:


We maintain ΔI on every query evaluation

If the potential aggregated benefit of materializing a candidate index exceeds its creation cost, we should create it, since we gathered enough evidence that the index is useful

Technion



Index tuning algorithm

Remove “bad” indexes phaseNotation:

Δmin – minimum Δ value for index I Δmax – maximum Δ value for index I BI – the cost of creating index I Residual(I) = BI – (Δmax – Δ)

(the “slack” an index has before being deemed “droppable”)IF (Residual(I)) <= 0) THEN Drop(I)

Net-Benefit(I) = (Δ-Δmin)-BI

(the benefit from creating the index)IF (Net-Benefit(I) >= 0) THEN Add(I)


Technion



Index tuning algorithm

Notation: ITM – all the indexes from H which creation is

cost effective ITD – subset of existing indexes such that:

ITD fits in existing memory It’s still cost effective to create new index I

after possibly dropping members from ITD

If creating index I is more effective than maintaining existing indexes in ITD, DROP(ITD) && CREATE(I)

Remove I from H (set of candidate indexes to materialize)


Experimental Evaluation


checkCarts($1) =from p in Products.AsEnumerable()join cart in Carts.AsEnumerable()

on p.Field<int>("id") equals cart.Field<int>("p_id")join c in Customers.AsEnumerable()

on cart.Field<int>("cu_id") equals c.Field<int>("id")where c.name = $1 select new { cart, p }

Possible IndexesI1 Categories(par_id)I2 Products(c_id)I3 Carts(cu_id)I4 Products(ca_id)I5 Customers(name)

browseProducts($1) =from p in Products.AsEnumerable()join c in Categories.AsEnumerable()on p.Field<int>("ca_id") equalsc.Field<int>("id")where c.par id = $1select pGenerated:

• 200,000 products• 50,000 customers• 1,000 categories• 5,000 items in the shopping

carts

Consider the following schema:

Execution plans for evaluation queries


Experimental Evaluation – Cont.


Generated schedule when tuning was disabled

Experimental Evaluation – Cont.


Generated schedule when tuning was enabled

Summary

We’ve discussed: LINQ – for declarative query formulation DataSet - a uniform way of representing in-

memory data. A lightweight optimizer for automatically

adjusting query execution strategies

Article’s main contribution: NOT a new query processing technique BUT: careful engineering of traditional

database concepts in a new context


LOGO

Simon Zeltser

LINQ Execution Model


Compiler merges LINQ

extension methods

Query syntax is converted to function calls and lambda expressions

Lambda expressions are converted to expression trees

Compiler finds a query pattern

Query is executed

lazily

Compiler infers types produced by queries

Adds query operations to IEnumerable<T>

At compile time Expressions are evaluated at run-time

Parsed and type checked at compile-time

Datasets are strongly typed

Operations ondata sets are strongly typed

Specialized or base Can optimize and re-write query

Expressions and operationscan execute remotely At run-time, when results are used We can force evaluations (ToArray())

Documents

LOGO Simon Zeltser Towards Declarative Queries on Based on the article by Nicolas Bruno and Pablo Castro