Upload
sergey-petrunya
View
327
Download
3
Embed Size (px)
Citation preview
Sergei PetruniaVicentiu Ciorbaru
Window functionsin MariaDB
2
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
3
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
4
Scalar functions
select concat(Name, ' in ', Country) from Cities
Peking in CHNBerlin in DEUMoscow in RUSChicago in USA
+------+---------+---------+| ID | Name | Country |+------+---------+---------+| 1891 | Peking | CHN || 3068 | Berlin | DEU || 3580 | Moscow | RUS || 3795 | Chicago | USA |+------+---------+---------+
• Compute values based on the current row
5
Aggregate functions• Compute summary for the group
• Group is collapsed into summary row
select country, sum(Population) as totalfrom Citiesgroup by country
+-----------+---------+------------+| name | country | population |+-----------+---------+------------+| Berlin | DEU | 3386667 || Frankfurt | DEU | 643821 || Moscow | RUS | 8389200 || New York | USA | 8008278 || Chicago | USA | 2896016 || Seattle | USA | 563374 |+-----------+---------+------------+
+---------+----------+| country | total |+---------+----------+| DEU | 4030488 || RUS | 8389200 || USA | 11467668 |+---------+----------+
6
Window functions
• Function is computed over an ordered partition (=group)
• Groups are not collapsed
select name, rank() over (partition by country, order by population desc) from cities
+-----------+---------+------------+| name | country | population |+-----------+---------+------------+| Berlin | DEU | 3386667 || Frankfurt | DEU | 643821 || Moscow | RUS | 8389200 || New York | USA | 8008278 || Chicago | USA | 2896016 || Seattle | USA | 563374 |+-----------+---------+------------+
+-----------+------+| name | rank |+-----------+------+| Berlin | 1 || Frankfurt | 2 || Moscow | 1 || New York | 1 || Chicago | 2 || Seattle | 3 |+-----------+------+
7
Window functions
• Function is computed over an ordered partition (=group)
• Groups are not collapsed
select name, rank() over (partition by country, order by population desc) from cities
+-----------+---------+------------+| name | country | population |+-----------+---------+------------+| Berlin | DEU | 3386667 || Frankfurt | DEU | 643821 || Moscow | RUS | 8389200 || New York | USA | 8008278 || Chicago | USA | 2896016 || Seattle | USA | 563374 |+-----------+---------+------------+
+-----------+------+| name | rank |+-----------+------+| Berlin | 1 || Frankfurt | 2 || Moscow | 1 || New York | 1 || Chicago | 2 || Seattle | 3 |+-----------+------+
8
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
9
Basic Window Functions
select name, incidentsfrom support_staff
+----------+-----------+| name | incidents |+----------+-----------+| Claudio | 10 || Valeriy | 9 || Daniel | 9 || Geoff | 9 || Stephane | 8 |+----------+-----------+
10
row_number()
select name, incidents, row_number() over (order by incidents desc) as ROW_NUMfrom support_staff
+----------+-----------+---------+| name | incidents | ROW_NUM |+----------+-----------+---------+| Claudio | 10 | 1 || Valeriy | 9 | 2 || Daniel | 9 | 3 || Geoff | 9 | 4 || Stephane | 8 | 5 |+----------+-----------+---------+
11
rank()
select name, incidents, row_number() over (order by incidents desc) as ROW_NUM, rank() over (order by incidents desc) as RANK,from support_staff
+----------+-----------+---------+------+| name | incidents | ROW_NUM | RANK |+----------+-----------+---------+------+| Claudio | 10 | 1 | 1 || Valeriy | 9 | 2 | 2 || Daniel | 9 | 3 | 2 || Geoff | 9 | 4 | 2 || Stephane | 8 | 5 | 5 |+----------+-----------+---------+------+
12
dense_rank()
select name, incidents, row_number() over (order by incidents desc) as ROW_NUM, rank() over (order by incidents desc) as RANK, dense_rank() over (order by incidents desc) as DENSE_R,from support_staff
+----------+-----------+---------+------+---------+| name | incidents | ROW_NUM | RANK | DENSE_R |+----------+-----------+---------+------+---------+| Claudio | 10 | 1 | 1 | 1 || Valeriy | 9 | 3 | 2 | 2 || Daniel | 9 | 4 | 2 | 2 || Geoff | 9 | 2 | 2 | 2 || Stephane | 8 | 5 | 5 | 3 |+----------+-----------+---------+------+---------+
13
ntile(n)
select name, incidents, row_number() over (order by incidents desc) as ROW_NUM, rank() over (order by incidents desc) as RANK, dense_rank() over (order by incidents desc) as DENSE_R, ntile(4) over (order by incidents desc) as QARTILE,from support_staff
+----------+-----------+---------+------+---------+----------+| name | incidents | ROW_NUM | RANK | DENSE_R | QUARTILE |+----------+-----------+---------+------+---------+----------+| Claudio | 10 | 1 | 1 | 1 | 1 || Valeriy | 9 | 2 | 2 | 2 | 1 || Daniel | 9 | 3 | 2 | 2 | 2 || Geoff | 9 | 4 | 2 | 2 | 3 || Stephane | 8 | 5 | 5 | 3 | 4 |+----------+-----------+---------+------+---------+----------+
14
Conclusions so far
• Window functions are similar to aggregates
• Computed on (current_row, ordered_list(window_rows))
• Can compute relative standing of row wrt other rows
• RANK, DENSE_RANK, ...
15
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
16
Framed window functions
• Some Window Functions use FRAMES
– e.g. Aggregates that are used as window functions
• Window function is computed on rows in the frame.
• Frame is inside PARTITION BY
• Frame moves with the current row
• There are various frame types
17
Smoothing Noisy Data
• Noisy data acquisition solution
17
SELECT time, raw_data FROM sensor_data;
18
Smoothing Noisy Data
• Noisy data acquisition solution
SELECT time, raw_data AVG(raw_data) OVER ( ) FROM sensor_data;
19
Smoothing Noisy Data
• Noisy data acquisition solution
SELECT time, raw_data AVG(raw_data) OVER ( ORDER BY time ) FROM sensor_data;
20
Smoothing Noisy Data
• Noisy data acquisition solution
SELECT time, raw_data AVG(raw_data) OVER ( ORDER BY time ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING ) FROM sensor_data;
21
Smoothing Noisy Data
• Noisy data acquisition solution
SELECT time, raw_data AVG(raw_data) OVER ( ORDER BY time ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING ) FROM sensor_data;
22
Account balance statement
• Generate balance sheet for bank account.
• Incoming transactions.
• Outgoing transactions.
+-----+----------+--------+| tid | date | amount |+-----+----------+--------+| 1 | 20160401 | 2000 || 2 | 20160402 | -30.5 || 3 | 20160404 | -45.5 || 4 | 20160405 | -125.5 || 5 | 20160406 | 100.3 |+-----+----------+--------+
select tid, date, amountfrom transactionswhere account_id = 12345;
23
Account balance statement
SELECT tid, date, amountFROM transactionsWHERE account_id = 12345;
+-----+----------+--------+| tid | date | amount |+-----+----------+--------+| 1 | 20160401 | 2000 || 2 | 20160402 | -30.5 || 3 | 20160404 | -45.5 || 4 | 20160405 | -125.5 || 5 | 20160406 | 100.3 |+-----+----------+--------+
24
Account balance statement
SELECT tid, date, amount, ( SELECT SUM(amount) FROM transactions t WHERE t.date <= date AND account_id = 12345 ) AS balanceFROM transactionsWHERE account_id = 12345;
+-----+----------+--------+----------+| tid | date | amount | balance |+-----+----------+--------+----------+| 1 | 20160401 | 2000 | 2000 || 2 | 20160402 | -30.5 | 1969.5 || 3 | 20160404 | -45.5 | 1924 || 4 | 20160405 | -125.5 | 1798.5 || 5 | 20160406 | 100.3 | 1898.8 |+-----+----------+--------+----------+
25
Account balance statement
SELECT tid, date, amount, SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS balanceFROM transactionsWHERE account_id = 12345;
+-----+----------+--------+----------+| tid | date | amount | balance |+-----+----------+--------+----------+| 1 | 20160401 | 2000 | 2000 || 2 | 20160402 | -30.5 | 1969.5 || 3 | 20160404 | -45.5 | 1924 || 4 | 20160405 | -125.5 | 1798.5 || 5 | 20160406 | 100.3 | 1898.8 |+-----+----------+--------+----------+
26
Account balance statement
• How do queries compare?
# Rows Regular SQL Window Functions
100 3.72 sec 0.01 sec
500 30.04 sec 0.01 sec
1000 59.6 sec 0.02 sec
2000 1 min 59 sec 0.03 sec
4000 4 min 1 sec 0.04 sec
16000 18 min 26 sec 0.18 sec
27
RANGE-type frames
• Useful when interval of interest has multiple/missing rows
• ORDER BY column -- one numeric column
• RANGE n PRECEDINGrows with R.column >= (current_row.column – n)
• RANGE n FOLLOWINGrows with R.column <= (current_row.column + n)
• CURRENT ROWcurrent row and rows with R.column = current_row.column
28
RANGE-type frames
• Expenses from today and yesterday:
+----------+-------+--------+------+| exp_date | name | amount | sum |+----------+-------+--------+------+| 20160407 | bus | 4 | 4 || 20160409 | beer | 2 | || 20160410 | wine | 4 | || 20160410 | snack | 12 | |+----------+-------+--------+------+
select *, sum(amount) over (order by exp_date range between 1 preceding and current row) as sumfrom expenses
29
RANGE-type frames
• Expenses from today and yesterday:
+----------+-------+--------+------+| exp_date | name | amount | sum |+----------+-------+--------+------+| 20160407 | bus | 4 | 4 || 20160409 | beer | 2 | || 20160410 | wine | 4 | || 20160410 | snack | 12 | |+----------+-------+--------+------+
select *, sum(amount) over (order by exp_date range between 1 preceding and current row) as sumfrom expenses
30
RANGE-type frames
• Expenses from today and yesterday:
+----------+-------+--------+------+| exp_date | name | amount | sum |+----------+-------+--------+------+| 20160407 | bus | 4 | 4 || 20160409 | beer | 2 | 2 || 20160410 | wine | 4 | || 20160410 | snack | 12 | |+----------+-------+--------+------+
select *, sum(amount) over (order by exp_date range between 1 preceding and current row) as sumfrom expenses
31
RANGE-type frames
• Expenses from today and yesterday:
+----------+-------+--------+------+| exp_date | name | amount | sum |+----------+-------+--------+------+| 20160407 | bus | 4 | 4 || 20160409 | beer | 2 | 2 || 20160410 | wine | 4 | 18 || 20160410 | snack | 12 | |+----------+-------+--------+------+
select *, sum(amount) over (order by exp_date range between 1 preceding and current row) as sumfrom expenses
32
RANGE-type frames
• Expenses from today and yesterday:
+----------+-------+--------+------+| exp_date | name | amount | sum |+----------+-------+--------+------+| 20160407 | bus | 4 | 4 || 20160409 | beer | 2 | 2 || 20160410 | wine | 4 | 18 || 20160410 | snack | 12 | 18 |+----------+-------+--------+------+
select *, sum(amount) over (order by exp_date range between 1 preceding and current row) as sumfrom expenses
33
Date columns with RANGE-type frames
• Date columns and temporal intervals (MDEV-9727)
AVG(value) OVER (ORDER BY date_col RANGE BETWEEN INTERVAL 1 MONTH PRECEDING AND INTERVAL 1 MONTH FOLLOWING)
• SQL Standard allows this
• Not supported by PostgreSQL or MS SQL Server
• Intend to support in MariaDB.
34
FRAME syntax
• ROWS|RANGE PRECEDING|FOLLOWING:
35
Frames summary
• Some window functions use frames
– e.g. Aggregate functions used as window functions
• Frame moves with the current row
• RANGE/ROWS-type frames
– MariaDB supports all kinds
• Useful for
– Cumulative sums– Running averages– Getting aggregates without doing GROUP BY
36
The Island problem
• Given a set of ordered integers, find the start and end of sequences that have no missing numbers.
Ex: 2, 3, 10, 11, 12, 15, 16, 17
• A common problem, with plenty of use cases:
– Used in sales to identify activity periods.– Detecting outages.– Stock market analysis.
37
The Island problem
SELECT valueFROM islandsORDER BY value;
+-------+| value |+-------+| 2 || 3 || 10 || 11 || 12 || 15 || 16 || 17 |+-------+
+-------------+-----------+| start_range | end_range |+-------------+-----------+| 2 | 3 || 10 | 12 || 15 | 17 |+-------------+-----------+
38
The Island problem
SELECT value, (SELECT ??? ) AS grpFROM islandsORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | a || 3 | a || 10 | b || 11 | b || 12 | b || 15 | c || 16 | c || 17 | c |+-------+------+
+-------------+-----------+| start_range | end_range |+-------------+-----------+| 2 | 3 || 10 | 12 || 15 | 17 |+-------------+-----------+
39
The Island problem
SELECT value, (SELECT ??? ) AS grpFROM islandsORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | a || 3 | a || 10 | b || 11 | b || 12 | b || 15 | c || 16 | c || 17 | c |+-------+------+
+-------------+-----------+| start_range | end_range |+-------------+-----------+| 2 | 3 || 10 | 12 || 15 | 17 |+-------------+-----------+
SELECT MIN(value) AS start_range MAX(value) AS end_range FROM islandsGROUP BY grp;
40
The Island problem – generating the groups
SELECT value, (SELECT ??? ) AS grpFROM islandsORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | 3 || 3 | 3 || 10 | 12 || 11 | 12 || 12 | 12 || 15 | 17 || 16 | 17 || 17 | 17 |+-------+------+
41
The Island problem – generating the groups
SELECT value, ( SELECT MIN(B.value) FROM islands AS B WHERE B.value >= A.value AND NOT EXISTS ( SELECT * FROM islands AS C WHERE C.col1 = B.col1 + 1) ) AS grpFROM islands as AORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | 3 || 3 | 3 || 10 | 12 || 11 | 12 || 12 | 12 || 15 | 17 || 16 | 17 || 17 | 17 |+-------+------+
42
The Island problem – generating the groups
SELECT value, ( SELECT MIN(B.value) FROM islands AS B WHERE B.value >= A.value AND NOT EXISTS ( SELECT * FROM islands AS C WHERE C.value = B.value + 1) ) AS grpFROM islands as AORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | 3 || 3 | 3 || 10 | 12 || 11 | 12 || 12 | 12 || 15 | 17 || 16 | 17 || 17 | 17 |+-------+------+
43
The Island problem – generating the groups
43
SELECT value, ROW_NUMBER() OVER (ORDER BY value) AS grpFROM islands as AORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | 1 || 3 | 2 || 10 | 3 || 11 | 4 || 12 | 5 || 15 | 6 || 16 | 7 || 17 | 8 |+-------+------+
44
The Island problem – generating the groups
SELECT value, value - ROW_NUMBER() OVER (ORDER BY value) AS grpFROM islands as AORDER BY value;
+-------+------+| value | grp |+-------+------+| 2 | 1 || 3 | 1 || 10 | 7 || 11 | 7 || 12 | 7 || 15 | 9 || 16 | 9 || 17 | 9 |+-------+------+
45
The Island problem – generating the groups
SELECT value, value - ROW_NUMBER() OVER (ORDER BY value) AS grpFROM islands as AORDER BY value;
SELECT value, ( SELECT MIN(B.value) FROM islands AS B WHERE B.value >= A.value AND NOT EXISTS (SELECT * FROM islands AS C WHERE C.value = B.value + 1)
) AS grp
FROM islands as AORDER BY value;
46
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
SergeyP
47
Window Functions and other SQL constructs
• Can have WIN_FUNC(AGG_FUNC)
Join Group Check HAVING
DISTINCT Sort + Limit
Compute Window Functions
• Window functions can appear in
– SELECT list– ORDER BY clause
48
Filtering on window function value
• How to filter for e.g. RANK() < 3 ? Use a subquery.
select name, incidents, row_number() over (order by incidents desc) as ROW_NUM from support_staff
49
Filtering on window function value
• How to filter for e.g. RANK() < 3 ? Use a subquery.
select * from ( select name, incidents, row_number() over (order by incidents desc) as ROW_NUM from support_staff) as TBL where TBL.ROW_NUM < 3
50
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
51
Computing Window functions
group
table
table
table
join
• Join, grouping
• All partitions are mixed together
52
Computing Window functions
• Put join output into a temporary table
• Sort it by(PARTITITION BY clause, ORDER BY clause)
group
table
table
table
join
sort
Sort by:PARTITION BY clause,ORDER BY clause
53
Computing window function for a row
• Can look at
– Current row– Rows in the partition, ordered
• Can compute the window function
• Computing values individually would be expensive
– O(#rows_in_partition ^ 2)
54
“Streamable” window functions
• ROW_NUMBER, RANK, DENSE_RANK, ...
– Can walk down and compute values on the fly
• NTILE, CUME_DIST, PERCENT_RANK– Get #rows in the partition – Then walk down and compute values on
the fly.
55
Computing framed window functions
• window_func(rows_in_the_frame)
• Frame moves with the current row
56
Computing framed window functions
20
10
$total+10-20
$total
• window_func(rows_in_the_frame)
• Frame moves with the current row
• Some functions allow to add and remove rows
– SUM, COUNT, AVG, BIT_OR, BIT_*• Can compute efficiently
– Done in MariaDB 10.2.0.
57
Some aggregates make streaming hard
20
21
19
10
MAX= ?
MAX=21
• MIN, MAX
• Need to track the whole window
– Doable for small frames● Can also re-calculate
– Hard for bigger frames• Are big frames used?
• Not implemented yet.
58
LEAD and LAG issues
• LAG(expr, N) – “expr N rows before”
– LAG(expr,1) - previous
• Non-constant N?
• Lookups to arbitrary rows
– Expensive– Worth doing at all?
LAG(..., 2)
59
Summary for computing window functions
• Sort by (partition_by, order_by)
• Then walk through and compute window functions
• Most functions can be computed on-the-fly
• Framed window functions require moving the frame
– SUM, COUNT, AVG .. - can update value as frame moves– MIN, MAX – more complex
• LEAD, LAG may require random reads
60
Plan
• What are window functions
– Basic window functions– Frames– Window functions and other parts of SQL
• Computing window functions
• Optimizations
61
Optimizations
Do window functions optimizations matter?
62
join
A query with window functionsselect 'web' as channel ,web.item ,web.return_ratio ,web.return_rank ,web.currency_rankfrom ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select ws.ws_item_sk as item ,(cast(sum(coalesce(wr.wr_return_quantity,0)) as decimal(15,4))/ cast(sum(coalesce(ws.ws_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(wr.wr_return_amt,0)) as decimal(15,4))/ cast(sum(coalesce(ws.ws_net_paid,0)) as decimal(15,4) )) as currency_ratio from web_sales ws left outer join web_returns wr on (ws.ws_order_number = wr.wr_order_number and ws.ws_item_sk = wr.wr_item_sk) ,date_dim where wr.wr_return_amt > 10000 and ws.ws_net_profit > 1 and ws.ws_net_paid > 0 and ws.ws_quantity > 0 and ws_sold_date_sk = d_date_sk and ws_sold_date_sk between 2452245 and 2452275 and d_year = 2001 and d_moy = 12 group by ws.ws_item_sk ) in_web ) webwhere web.return_rank <= 10 or web.currency_rank <= 10
Window functions
group
table
table
table
sort
63
Still, there are optimizations
• Doing fewer sorts
• Condition pushdown through PARTITION BY
64
Doing fewer sorts
tbl
tbl
tbl
join
sort
select rank() over (order by incidents), ntile(4)over (order by incidents), rank() over (order by incidents, join_date),from support_staff
• Each window function requires a sort
• Can avoid sorting if using an index (MariaDB: not yet)
• Identical PARTITION/ORDER BY must share the sort step
• Compatible may share the sort step– MariaDB: yes (but have bugs atm)– PostgreSQL: yes, limited
65
Condition pushdown through PARTITION BY
select * from ( select name, rank() over (partition by dept order by incidents desc) as R from staff) as TBL where dept='Support'
staff sort
Development
Consulting
Supportsort
66
Condition pushdown into PARTITION BY
• Other databases have this
• In MariaDB, requires: MDEV-9197: Pushdown conditions into non-mergeable views/ derived tables
MDEV-7486: Condition pushdown from HAVING into WHERE
• These are 10.2 tasks too
• Considering it
67
Optimizations summary
• Not much need/room for optimizations in many cases
– Window function is a small part of the query
• Optimizations to have
– Share the sort across window functions (have [bugs])
– Condition pushdown through PARTITION BY
● Depends on another 10.2 task● Want to have it
68
Conclusions
• Window functions coming in MariaDB 10.2!
• Already have ~SQL:2003 level features
• Intend to have ~SQL:2011 features
– Comparable with “big 3” databases
• Work on optimizations is in progress– Send us your cases.
69
Thanks
Q & A