Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Data Warehouse in the Cloud –
Marketing or Reality?
Alexei Khalyako
Sr. Program Manager
Windows Azure Customer Advisory Team
Data Warehouse we used to know
• High-End workload
• High-End hardware
• Special know-how
*BeyeNetwork Big Data research
Reality is
• Thousands of departmental level DW
• Relatively low perfSLA
New BI demands
• Utilize external data sources
• Non Structured Data
• Origin is in the Cloud
*BeyeNetwork Big Data research
New opportunity
• Platform is there
– Iaas SQL VM
– Paas SQL Azure DB
• “Closer” to data
• Less administrative overhead
• Lower initial and TCO cost
SQL Server Data Warehousing in Windows Azure Virtual Machines
• Inspired by the Fast Track Reference Architecture guide
• Based on the High Memory images
• Up to 1TB
• MSDN: SQL Server Data Warehousing in Windows Azure Virtual Machines
High Memory VM in Azure
How to deploy
• Powershell script • Windows Azure Gallery
The Azure Data Warehouse under the hood
Data Warehouse Lifecycle• Thoughts on the architecture• Creating DB• Connectivity• Populating Database
– Initial data loading OR– Backup/Restore– Incremental data loading– Compression
• Query performance
Thoughts on the architecture
• Data Loading– Minimize Log impact– Scale loading streams– Do not invent the wheel and follow
the Data loading Performance guide
• Query Performance
! Do not invent the wheel and follow the Data loading
Performance guide
Windows Azure VM Architecture
• Disks implemented as a shared multi-tenant service
• Built-in triple redundancy, optional geo-redundancy
• Performance less predictable than on-prem
Host machines, storage services, network bandwidth shared between subscribersPerf can depend on where and when VM is provisionedSubject to maintenance operationsGranular control & configurability vs. cost, simplicity, out of box redundancy
Storage Stamp
Stream Layer
Partition Layer
Front-ends
LB
Intra-stamp replication
Stream Layer
Partition Layer
Front-ends
LB
Intra-stamp replication
Storage Stamp
Geo-replication
Storage Location Service
Tweaks to improve IO Subsystem • Database file initialization
– GPEdit.msc
• Data file placement– SQL Striping for User Data
and TempDB– Aggregated throughput– Set the size and data grow options wisely
Primary
Log
*You may do it differently. Then• Create 350GB DB took ~3 hours
Scaling IO OptionsWindows Storage Spaces
• Log drive
• Not clear support story
SQL Data Files
• Spread File Group over all drives
Scaling IO OptionsData disk (read) LOG (write)
SQLIO Single Data Disk(64K)
SQLIO Windows Storage SpacesX3 Disks (64K)
SQLIO SQL Striping x3 Disk(64K) *
CUMULATIVE DATA:throughput metrics:
IOs/sec: 1215.13MBs/sec: 75.94
CUMULATIVE DATA:throughput metrics:
IOs/sec: 2677.69MBs/sec: 167.35
CUMULATIVE DATA:throughput metrics:
IOs/sec: 2742.22MBs/sec: 171.38
SQLIO Single Data Disk(256K)
SQLIO Windows Storage SpacesX3 Disks (256K)
SQLIO SQL Striping x3 Disk(256K)
CUMULATIVE DATA:
throughput metrics:IOs/sec: 288MBs/sec: 71.98
CUMULATIVE DATA:throughput metrics:
IOs/sec: 640.87MBs/sec: 160.21
CUMULATIVE DATA:throughput metrics:
IOs/sec: 599.91MBs/sec: 149.97
* But we can access one file at the time!
Connectivity Options
Windows Azure VM End Points Point-to-Site /Site-to-Site
*Other options are also available ( FTP)
What and how we tested
Getting initial data
• Copy backup to the Data Disks
• Backup/Restore to/from URL
• ETL to the new DB
URL is fast!Backup to the Local Data Disk
Backup to the URL
DB Size Time Speed
244GB 3 hours 22,978 MB/sec
DB Size Time Speed
244GB 46 min 90,667 MB/sec
DB and Data Loading
? Data loading ? Tools (BCP, SSIS..)
? Time SLA
? Query Performance? Indexing strategy
? Sizing? Compression
Loading Data in Azure • Smaller batches (10K -15K rows)• Retry logic• Network latency is high• Parallel loading!!
Start with: SSIS for Hybrid Data MovementSSIS Performance and Operational guide
Baseline
• Understand Data Sources performance
– Flat File in Azure VM ~60 MB/sec /reads
– SQLIO shows the max throughput of the IO subsystem on the DB side
– App performance can be different
Parallel LoadingFlat fileMax 60 MB/sec
Flat fileMax 60 MB/sec
8 destinations to keep all CPU busy on the DW site
Mod(7) function
Begin to load
Monitoring Loading Performance
You will be followed by TOP waits:ASYNC_NETWORK_IOPAGEIOLATCH_EXWRITELOGPAGEIOLATCH_UPSOS_SCHEDULER_YIELDPAGEIOLATCH_SHPAGELATCH_UPPREEMPTIVE_OLEDBOPS
Network IO
Disk IO
CPU
Loading: table optionsHeap
• 780 772 573 rows
• Elapsed time: 01:06:15.313
Heap compressed
• 780 772 573 rows
• Elapsed time: 05:12:06.094
Loading: table optionsHEAP Clustered Index
• Sort!
• Elapsed time: 01:20:12.547
• 780 772 573 rows
• Elapsed time: 01:06:15.313
Query Performance
• Heap
• Primary Key/Clustered Index
• Compression
Query performance: results
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
Qu
ery
Exec
uti
on
tim
e
Query Type
Heap
Clustered Index
Clustered Index Compressed
Please welcome on stage SQL 2014
What’s new?
• Data files to BLOBs
• Updateable Clustered Column Store index
Loading dataClustered Column store Index
Load test 2 hours 16 min
Heap
1 hour 1 min
SQL 2014
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
CI 708 92. 385 695 834 413 407 508 971 791 251 529 209 393 345 108 729 725 742 111 221 214
CIComp 388 64. 207 382 483 162 253 485 653 451 241 279 158 139 147 85. 387 405 412 108 120 108
CSI 18. 9.9 20. 16. 20. 7.4 11. 24. 70. 63. 41. 14. 96. 12. 6.9 64. 11. 100 9.7 31. 51. 18.
0.00
500.00
1000.00
1500.00
2000.00
2500.00
Qu
ery
Exe
cuti
on
tim
e
Query type
CI CIComp CSI
x51
x39x2
x55
Query 19
Estimates vs Actual
And the winner is…
SQL Server 2014!!
Summary
• Easy and fast deployment through he Gallery or PS scripts
• Azure Data Warehouse is consistent with the most of the best practices– Query
– Loading
• Low Initial investments and TCO
THANK YOU!• For attending this session and
PASS SQLRally Nordic 2013, Stockholm