Upload
daisy-summers
View
215
Download
0
Embed Size (px)
Citation preview
Turbocharge your Data Warehouse Queries with Columnstore IndexesLen WyattProgram ManagerMicrosoft Corporation
DBI313
Agenda
MotivationHow columnstores speed up queriesLoading columnstoresOptimizing database and index design Optimizing queries
demo
Columnstores speed up queries
Overview of Columnstore Index
How ColumnStore Indexes Speed Up Queries6
…
C1 C2 C3 C5 C6C4
ColumnStore indexes store data column-wise
Each page stores data from a single column
Highly compressedAbout 2x better than PAGE compressionMore data fits in memory
Each column can be accessed independently
Fetch only columns neededCan dramatically decrease IO
Heaps, B-trees store data row-wise
Columnstore Index Structure
Column SegmentSegment contains values from one column for a set of rowsSegments for the same set of rows comprise a row groupSegments are compressedEach segment stored in a separate LOBSegment is unit of transfer between disk and memory
7
Segments
C1 C2 C3 C5 C6C4
Row group
Columnstore Index Example
OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount
20101107 106 01 1 6 30.00
20101107 103 04 2 1 17.00
20101107 109 04 2 2 20.00
20101107 103 03 2 1 17.00
20101107 106 05 3 4 20.00
20101108 106 02 1 5 25.00
20101108 102 02 1 1 14.00
20101108 106 03 2 5 25.00
20101108 109 01 1 1 10.00
20101109 106 04 2 4 20.00
20101109 106 04 2 5 25.00
20101109 103 01 1 1 17.00
Horizontally Partition (Row Groups)OrderDateKey ProductKe
yStoreKey RegionKey Quantity SalesAmount
20101107 106 01 1 6 30.00
20101107 103 04 2 1 17.00
20101107 109 04 2 2 20.00
20101107 103 03 2 1 17.00
20101107 106 05 3 4 20.00
20101108 106 02 1 5 25.00OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount
20101108 102 02 1 1 14.00
20101108 106 03 2 5 25.00
20101108 109 01 1 1 10.00
20101109 106 04 2 4 20.00
20101109 106 04 2 5 25.00
20101109 103 01 1 1 17.00
Vertically Partition (Segments)
OrderDateKey
20101107
20101107
20101107
20101107
20101107
20101108
ProductKey
106
103
109
103
106
106
StoreKey
01
04
04
03
05
02
RegionKey
1
2
2
2
3
1
Quantity
6
1
2
1
4
5
SalesAmount
30.00
17.00
20.00
17.00
20.00
25.00
OrderDateKey
20101108
20101108
20101108
20101109
20101109
20101109
ProductKey
102
106
109
106
106
103
StoreKey
02
03
01
04
04
01
RegionKey
1
2
1
2
2
1
Quantity
1
5
1
4
5
1
SalesAmount
14.00
25.00
10.00
20.00
25.00
17.00
Compress Each Segment*OrderDateKey
20101107
20101107
20101107
20101107
20101107
20101108
ProductKey
106
103
109
103
106
106
StoreKey
01
04
04
03
05
02
RegionKey
1
2
2
2
3
1
Quantity
6
1
2
1
4
5
SalesAmount
30.00
17.00
20.00
17.00
20.00
25.00
Some segments will compress more than others
OrderDateKey
20101108
20101108
20101108
20101109
20101109
20101109
ProductKey
102
106
109
106
106
103
StoreKey
02
03
01
04
04
01
RegionKey
1
2
1
2
2
1
Quantity
1
5
1
4
5
1
SalesAmount
14.00
25.00
10.00
20.00
25.00
17.00
*Encoding and reordering not shown
Fetch Only Needed ColumnsSELECT ProductKey, SUM (SalesAmount) FROM SalesTable WHERE OrderDateKey < 20101108
StoreKey
01
04
04
03
05
02
StoreKey
02
03
01
04
04
01
RegionKey
1
2
2
2
3
1
RegionKey
1
2
1
2
2
1
Quantity
6
1
2
1
4
5
Quantity
1
5
1
4
5
1
OrderDateKey
20101107
20101107
20101107
20101107
20101107
20101108
OrderDateKey
20101108
20101108
20101108
20101109
20101109
20101109
ProductKey
106
103
109
103
106
106
ProductKey
102
106
109
106
106
103
SalesAmount
30.00
17.00
20.00
17.00
20.00
25.00
SalesAmount
14.00
25.00
10.00
20.00
25.00
17.00
Fetch Only Needed SegmentsSELECT ProductKey, SUM (SalesAmount) FROM SalesTable WHERE OrderDateKey < 20101108
StoreKey
01
04
04
03
05
02
StoreKey
02
03
01
04
04
01
RegionKey
1
2
2
2
3
1
RegionKey
1
2
1
2
2
1
Quantity
6
1
2
1
4
5
Quantity
1
5
1
4
5
1
OrderDateKey
20101107
20101107
20101107
20101107
20101107
20101108
OrderDateKey
20101108
20101108
20101108
20101109
20101109
20101109
ProductKey
106
103
109
103
106
106
ProductKey
102
106
109
106
106
103
SalesAmount
30.00
17.00
20.00
17.00
20.00
25.00
SalesAmount
14.00
25.00
10.00
20.00
25.00
17.00
Batch Mode Speeds Up Queries
Biggest advancement in SQL Server query processing in years…• Data moves as a batch through query
plan operators• Minimizes instructions per row• Takes advantage of cache structures
• Highly efficient algorithms• Better parallelism
Batch mode processing
Process ~1000 rows at a timeBatch stored in vector formOptimized to fit in L1 cache.
Vector operators implementedFilter, hash join, hash aggregation
Greatly reduced CPU time (7 to 40X)
15
bit
map o
f qu
alif
yin
g
row
s
Column vectors
Batch object
#1 Takeaway!
Make sure most of the work of the query happens in batch mode
Loading Columnstores Effectively
Loading new data into a columnstore index
Tables with columnstores can be read, not updated
Partition switching allowedINSERT, UPDATE, DELETE, and MERGE not allowed
Recommended methods for loading dataDisable, update, rebuildPartition switchingUNION ALL
Adding Data Using Disable, Update, Rebuild
Disable (or drop) the columnstore indexALTER INDEX my_index ON MyTable DISABLE
Update the tableRebuild the columnstore indexALTER INDEX my_index ON MyTable REBUILD
Adding Data Using Partition Switching
Columnstores must be partition-aligned Partition switching fully supportedTo add data daily
Partition by dayEvery day
Split last partitionLoad data into staging table and columnstore index itSwitch it in
Avoids costly drop/rebuild
Adding Data Using UNION ALL (trickle load)
Master table (columnstore)Delta table (rowstore)Query using UNION ALL local-global aggregation workaroundAdd Delta to Master nightly
Achieving Fast Columnstore Index BuildsMemory intensive
Memory requirement related to # of columns, data, DOPIndex build is parallel only if table has > 1 million rows
One thread per segmentLow memory throttles parallelismConsider
High min server memory setting Set REQUEST_MAX_MEMORY_GRANT_PERCENT to 50Add memoryOmit columnsReduce parallelism
create columnstore index <name> on <table>(<columns>) with (maxdop = 1);
Optimizing database and index design
Eliminating Unsupported Data TypesCurrent unsupported types for columnstores:
decimal > 18 digitsBinaryBLOB(n)varchar(max)UniqueidentifierDate/time types > 8 bytes and CLR
Omit column from columnstore, orModify column type to supported type
Reduce precision of numerics to 18 digits or lessConvert guid’s to intsReduce precision of datetimeoffset to 2 or lessConvert hierarchyid to int or string
Reduce Nonclustered B-trees
Covering B-trees are no longer needed on source tableExtra B-trees can cause optimizer to choose poor planSave spaceReduce ETL time
Ensuring segment elimination by date
Use clustered B-tree on date in source table
Columnstore inherits orderOr, partition by dateOrdering by load date, ship date, order date etc. can all work
Dates are naturally correlated
Design out strings from columnstoresString filters don’t get pushed to storage engine
more batches to processdefeats segment elimination
Joining on string columns is slowFactor strings out to dimensions
Date LicenseNum Measure
20120301 XYZ123 100
20120302 ABC777 200
Date LicenseId Measure
20120301 1 100
20120302 2 200
LicenseId LicenseNum
1 XYZ123
2 ABC777
Optimizing queries
Best Practices
Use star schemaPut columnstores on large tables onlyInclude every column of table in columnstore indexUse integer surrogate keys for joins
Forcing use or non-use of Columnstores
Query hintOPTION(IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX)
Index hint… FROM F WITH(index=MyColumnStore) …… FROM F WITH(index=MyClusteredBtree) …
Things to Avoid
Join/filter on string columnsJoin pairs of very large tables if you don’t have toNOT IN <subquery> on columnstore tableOUTER JOIN on columnstore tableUNION ALL to combine columnstore tables with other tables
Common workarounds
demo
Example need for a workaround
The common theme
Since there are some queries that the optimizer won’t be able to run in batch mode…Check execution plan to verify batch mode
Find the subset that can run in batch modeRewrite query to run mostly in batch modeJoin to the rest of the data
#1 Takeaway!
Make sure most of the work of the query happens in batch mode
Outer Join Example & Workaround
Outer join prevents batch processing
Rewrite queryInner join in batch modeLeft join to complete the data set
select m.Title, COUNT(p.IP) PurchaseCountfrom Media m left outer join Purchase p on p.MediaId=m.MediaIdgroup by m.Titleorder by COUNT(p.IP) desc
with T (Title, PurchaseCount) as ( select m.Title, COUNT(p.IP) PurchaseCount from Media m join Purchase p on p.MediaId=m.MediaId group by m.Title ) select distinct m.Title,
ISNULL(T.PurchaseCount,0) as PurchaseCountfrom Media m left outer join T on m.Title=T.Titleorder by ISNULL(T.PurchaseCount,0) desc;
6.4 sec elapsed55 CPU-seconds
0.2 sec elapsed1.9 CPU-sec
IN and EXISTs Example & Workaround
Using IN and EXISTS with subqueries can prevent batch mode execution
IN ( <constants list> ) typically works fine
Example:MediaId IN (23263, 29637, 27208)
select p.Date, count(*) from Purchase p where p.MediaId in (select MediaId from MediaStudyGroup) group by p.Date order by p.Date; --or--select p.Date, count(*) from Purchase p where exists (select m.MediaId from MediaStudyGroup m where m.MediaId = p.MediaId) group by p.Date order by p.Date;
select p.Date, count(*) from Purchase pjoin MediaStudyGroup m on p.MediaId = m.MediaId group by p.Date order by p.Date;
3.0 sec elapsed32 CPU-seconds
0.05 sec elapsed0.3 CPU-seconds
Union All Example
UNION ALL canprevent batch modeexecution
create view vPurchase as select * from Purchase union allselect * from DeltaPurchase;
select p.date, d.DayNumOfMonth, count(*) from vPurchase as p, Date d where p.Date = d.DateId group by p.date, d.DayNumOfMonth;
select p.date, d.DayNumOfMonth, m.Genre, count(*)from vPurchase p, Date d, Media mwhere p.Date = d.DateId and m.MediaId = p.MediaId group by p.date, d.DayNumOfMonth, m.Genre
Batch mode0.1 sec elapsed
Row mode19 sec elapsed
Union All Workaround
Push GROUP BY and aggregation over UNION ALLDo final GROUP BY and aggregation of resultsCalled “local-global aggregation”
with MainSummary (date, DayNumOfmonth, Genre, c) as ( select p.date, d.DayNumOfMonth, m.Genre, count(*) c from Purchase p, Date d, Media m where p.Date = d.DateId and m.MediaId = p.MediaId group by p.date, d.DayNumOfMonth, m.Genre ), DeltaSummary (date, DayNumOfmonth, Genre, c) as ( select p.date, d.DayNumOfMonth, m.Genre, count(*) c from DeltaPurchase p, Date d, Media m where p.Date = d.DateId and m.MediaId = p.MediaId group by p.date, d.DayNumOfMonth, m.Genre ), CombinedSummary (date, DayNumOfMonth, Genre, c) as ( --union all across the output of the two queries select * from MainSummary UNION ALL select * from DeltaSummary ) --group by to aggregate the data.select t.date, t.DayNumOfmonth, t.Genre, sum(c) as c from CombinedSummary as t group by t.date, t.DayNumOfmonth, t.Genre;
Batch mode0.3 sec elapsed
Scalar Aggregates Example & Workaround
Aggregate without group by doesn’t get batch processing
Workaround:Add a group by!
select count(*) from Purchase
with CountByDate (Date, c) as ( select Date, count(*) from Purchase group by Date ) select sum(c) from CountByDate;
1.0 sec elapsed15 CPU-seconds
0.06 sec elapsed0.3 CPU-seconds
Multiple DISTINCT aggregates example
Generates atable spoolSpool write/read is single threaded
SQL Server 2012runs queries with 1 DISTINCT aggand 1 or more non-distinct aggs in batch mode without any spool!
select p.Date, count(distinct p.UserId) as UserIdCount, count(distinct p.MediaId) as MediaIdCountfrom Purchase p, Media m where p.MediaId = m.MediaId and m.Category in ('Horror') group by p.Date;
26 sec elapsed31 CPU-seconds
Multiple DISTINCT aggregates workaround
Form each DISTINCT aggregate in aseparate subqueryJoin results on grouping keys
with DistinctMediaIds (Date, MediaIdCount) as ( select p.Date, count(distinct p.MediaId) as MediaIdCountfrom Purchase p, Media m where p.MediaId = m.MediaId and m.Category in ('Horror') group by p.Date ), DistinctUserIds (Date, UserIdCount) as ( select p.Date, count(distinct p.UserId) as UserIdCount from Purchase p, Media m where p.MediaId = m.MediaId and m.Category in ('Horror') group by p.Date ) select m.Date, m.MediaIdCount, u.UserIdCount from DistinctMediaIds m join DistinctUserIds u on m.Date=u.Date
0.5 sec elapsed6 CPU-seconds
Summary
Summary
Keys to fast query processingColumnstore Index + Batch mode = amazing performanceColumn and segment elimination greatly reduce data demand
Working with the read-only property of columnstores:
Drop, Update, RebuildPartition SwitchingUNION ALL method.
Future work will reduce need for query tuningFor now, make sure most work happens in batch mode
More Information
Columnstore Tuning Guide http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-performance-tuning.aspx
Columnstore FAQ and links to Customer Case Studies
http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
PRESENTATION.