View
1.391
Download
4
Category
Tags:
Preview:
DESCRIPTION
This is my presentation for SQL Saturday Philly 2012. The topic is managing SQL Server data warehouses with a look at the SQL Server data warehouse landscape and the challenges that a DBA must prepare for in large DW workloads and BI solutions.
Citation preview
Microsoft SQL Server Data Warehouses for SQL DBAs
SQL Saturday Philly June 9, 2012
MSSQLDUDE
Twitter: @mssqldudehttp://mssqldude.wordpress.com
SQL Server Magazine BI Bloghttp://www.sqlmag.com/blog/sql-server-bi-blog-17
mkromer@microsoft.com
Twitter: @jdanton
http://joedantoni.wordpress.com
jdanton1@yahoo.com
Agenda
• The SQL Server data warehouse landscape• Managing and Monitoring the data
warehouse− Managing fragmentation− Managing reporting users (SSRS)
• Tuning the DW databases− Compression considerations− TempDB & User Database layout
• Data Loading− SSIS (ETL)
• SAN & Data File Recommendations• End-user BI solutions
− Semantic Models (SSAS)− In-memory Analytics
Microsoft Data Warehousing Offerings
Tier 1 Offerings
EnterpriseFast Track Data
Warehouse
HP Business DW
Appliance
Parallel Data Warehouse
Scalable and reliable platform for Data
Warehousing on any hardware
Reference Architectures offering best price
performance for Data Warehousing
An affordable SMP solution for data warehousing on
optimized hardware
Appliance for high end Data Warehousing requiring
highest scalability, performance or complexity
Ideal for data marts or small to mid-sized
enterprise data warehouses (EDWs)
Ideal for data marts or small to mid-sized DWs with
scan centric workloads
Ideal for small data marts or DWs with scan centric
workloads
Offers flexibility in hardware and architecture
Software onlyReference Architectures (Software and Hardware)
Integrated Appliance (Software and Hardware)
DW Appliance(Fully integrated Software
and Hardware)
Scale up data warehousing Scale up data warehousing Scale up data warehousing
Scale out data warehousing with massively parallel processing (MPP)
10s of terabytes 4–80 terabytes Up to 5 terabytes 10s–100s of terabytes
Some Data Warehouses today
Big SANBig SMP ServerConnected together
What’s wrong with this picture?
Answer: system out of balance
This server can consume 12 GB/Sec of IO, but the SAN can only deliver 2 GB/Sec Even when the SAN is dedicated to the SQL Data
Warehouse, which it often isn’t Queries are slow
Despite significant investment in both Server and Storage
Result: significant investment, not delivering performance
Challenges of traditional Data Warehouse
The Alternative: A Balanced System
Design a server + storage configuration that can deliver all the IO bandwidth that CPUs can consume when executing a SQL Relational DW workload
Avoid sharing storage devices among servers
Avoid overinvesting in disk drives
Key to Fast Track Data Warehouse Architecture
Data Warehouse appliances for SQL Server
SQL Server Fast Track Data Warehouse
A method for designing a cost-effective, balanced system for Data Warehouse workloads
Reference hardware configurations developed in conjunction with hardware partners using this method
Best practices for data layout, loading and management
Solution to help customers and partners accelerate their data warehouse deployments
Fast Track Data Warehouse Components
Software:• SQL Server 2008 R2
Enterprise• Windows Server 2008 R2
Hardware:• Tight specifications for
servers, storage and networking
• ‘Per core’ building block
Configuration guidelines:• Physical table structures• Indexes• Compression• SQL Server settings• Windows Server settings• Loading
Core Fast Track Metrics
• These metrics are use to validate Fast Track− Maximum Consumption Rate (MCR)
− Ability of SQL Server to process data from memory for a specific CPU and Server combination and a standard SQL query.
− Benchmark Consumption Rate (BCR)− Ability of SQL Server to process data from disk for a specific CPU
and Server combination and a user workload or query.
System Benchmarking - MCR
• MCR: Measures the rate at which SQL Server can process data from memory for a given CPU & Server.− Measured per physical core− Page compressed data
• Similar in concept to “Miles Per Gallon” rating for a new car.− Not necessarily what you will see when you drive the car,
but a good starting point.• Current value for published Fast Track RA’s
− 200MB/s per core
Establishing Fast Track MCR
• Create a dataset that contains at least one table that will fit in memory− Enabled PAGE level compression− Use data representative of the target workload
• Load the table into memory by executing your chosen query against the table− After loading into memory execute a query to scan the table
and check to see if there is any disk activity
System Benchmarking - BCR
• BCR: Measures the rate at which SQL Server can process data from disk for a given CPU & Server for a unique customer workload− Measured per physical core− Page compressed data
• Similar in concept to “Actual Miles Per Gallon” with your current driving habits.
• A BCR that is 30% below the MCR rating for the system indicates a higher MCR rated system should be chosen
Establishing Fast Track BCR
• To determine BCR−Create a dataset of at least one table
− Table should be large enough to not fit entirely into Buffer Pool or in SAN array cache
− You can use synthetic dataset if customer data is not available
− If using synthetic dataset it is important to approximate the expected characteristics of the targeted data
DEMOManaging Large DWs on SQL Server EE
MCR & BCR
Fast Track Reference Configurations
2 Processor Configurations (5 – 20 TB, 2-3.7 GB/s) HP ProLiant DL380 G7 HP ProLiant DL385 G7 IBM System x3650 M3 Bull Novascale R460 E2
4 Processor Configurations (20 – 40 TB, 3.5-7.5 GB/s) HP ProLiant DL 580 G7 HP ProLiant DL 585 G7 IBM System x3850 x5 Bull Novascale R480 E1
8 processor Configurations (40 – 80 TB, 7.5-14 GB/s) HP ProLiant DL 980 G7
Represents storage array fully populated with 300GB15k SAS and use of 3:1 compression ratio. This includes the addition of one storage expansion tray per enclosure.
Data Warehouse Workload Characteristics
Scan Intensive
Hash Joins
Aggregations
SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS SUM_QTY,SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE,SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS SUM_DISC_PRICE,SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX))
AS SUM_CHARGE,AVG(L_QUANTITY) AS AVG_QTY,AVG(L_EXTENDEDPRICE) AS AVG_PRICE,AVG(L_DISCOUNT) AS AVG_DISC,COUNT(*) AS COUNT_ORDER
FROM LINEITEMGROUP BY L_RETURNFLAG,
L_LINESTATUSORDER BY L_RETURNFLAG,
L_LINESTATUS
Software configurationSQL Server Startup
• -E : Allocate 64 extents at a time (4MB)− This is not a guarantee of a logically contiguous extent
allocation• -T1117: Autogrow in even increments
Software configurationTemp DB
• Follow standard tempdb best practices− Auto-Grow should be enabled for tempdb
− Use large growth increment (10% of initial size)
• Create one Tempdb data file per LUN− Make all files the same size
• Transaction log is allocated to a dedicated Log LUN• Sizing the tempdb
− Typically 20-30% of primary data space− Tempdb is not compressed
Software configurationTemp DB & TLOG
• TempDB− Preallocate space, and add a single data file per LUN
− Make all files the same size.− Assign temp log files onto one of the LUNs dedicated to LOG files.− Enable Autogrow
• TLOG− Create a single transaction log file per database on one of the
LUNs assigned to the transaction log space− Spread log files for different databases across available LUNs or
use multiple log files for log growth as required.− Enable the autogrow option for log files.
• User DB Data Files− Remember: Do NOT enable autogrow;− Avoid fragmentation!
DW Server Baseline Configs
• Configuration: per Fast Track validated RA’s− Base Requirements
− 2 Socket server− At least 32GB main board memory− At least 1 8x PCIe open for HBA use− Blade servers not supported
• Memory: 4GB per core minimum− More memory may be warranted for a given customer
workload
DEMOManaging Large DWs on SQL Server EE
Resource Governor, Compression and Partitioning
Fast Track Data Striping
• Fast Track evenly spreads SQL data files across physical RAID-1 disk arrays
Microsoft Confidential
ARY01D1v01
ARY01D2v02
ARY02D1v03
ARY02D2v04
ARY03D1v05
ARY03D2v06
ARY04D1v07
ARY04D2v08
ARY05v09
DB1-1.ndf DB1-7.ndfDB1-5.ndfDB1-3.ndf
DB1-2.ndf DB1-4.ndf DB1-6.ndf DB1-8.ndf
DB1.ldf
Primary Data Log
FT Storage EnclosureRaid-1
Disk 1 & 2
User Databases
• Create at least one Filegroup containing one data file per LUN−FT targets 1:1 LUN to CPU core affinity−Make all files the same size−Effectively stripes database files across data LUNs
• Multiple file groups are necessary• Best Practice: Pre-allocate all databases and
do not use Autogrow.• When Autogrow is used
−Trace flag TF1117 will enforce even file growth and maintain the 4MB extent allocation (stripe width).
Transaction Log
• Create a single transaction log file per database and place on a dedicated Log LUN.
• Enable auto-grow for log files• The transaction log size for each database should
be at least twice the size of the largest DML operation
SQL Server File Layout
LUN16 LUN 2 LUN 3
Local Drive 1
Log LUN 1
Permanent DB Log
LUN 1
Tem
pD
B
TempDB.mdf (25GB) TempDB_02.ndf (25GB) TempDB_03ndf (25GB) TempDB_16.ndf (25GB)
Permanent FG
Permanent_1.ndf
Per
ma
na
nt_
DB
Sta
ge
D
ata
ba
se Stage FG
Stage_1.ndf Stage_2.ndf Stage_3.ndf Stage_16.ndf
Stage DB Log
Permanent_2.ndf Permanent_3.ndf Permanent_16.ndf
SQL Server Parallel Data Warehouse
Control Rack Data Rack
Compute Nodes Storage Nodes
Spare Compute
NodeD
ual
Fib
er
Ch
an
nel
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQLDu
al
Infi
nib
an
d
Control NodesActive / Passive
Landing Node
Backup Node
SQL
Management Nodes
Private Network
SQL
SQL
Data racksControl rack
Linear Scalability
Control Rack DataRack
Expand to 4 data racks and quadruple your performance and capacity!
4 Data Rack
• 47 Servers • 82 Procs• 492 Cores
1 Data Rack
•17 Servers•22 Procs•132 Cores
PDW POC Query Results
Q1 Q2 Q3 Q4 Q5 Q60
50010001500200025003000350040004500
16 6 2 2 2 4
4200
1200
120 120 120
1200
Query Speed in Seconds
PDW Time Orig. Time
263x 200x 60x 60x 60x 300x PDW times faster than original query speeds
Parallel Data Warehouse ApplianceHardware Architecture
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
Compute Nodes
Du
al
Infi
nib
an
d
Control Nodes
Active/Passive
Landing Node
Backup Node
Storage Nodes
Spare Compute Node
Du
al
Fib
er
Ch
an
nel
SQL
Management Nodes
Client Drivers
ETL Load Interface
Corporate Backup Solution
Data Center Monitoring
Corporate Network Private Network
Parallel Data Warehouse benefitsMassively Parallel Processing
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
Compute Nodes
Du
al
Infi
nib
an
d
Control Nodes
Active/Passive
Landing Node
Backup Node
Storage Nodes
Spare Compute Node
Du
al
Fib
er
Ch
an
nel
SQL
Management Nodes
Corporate Network Private Network
Query 1
Query 1 is submitted to SQL Server on Control Node
?
?????
???
?
Query is executed on all 10 NodesResults are sent back to client
Parallel Data Warehouse benefitsMassively Parallel Processing
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
Compute Nodes
Du
al
Infi
nib
an
d
Control Nodes
Active/Passive
Landing Node
Backup Node
Storage Nodes
Spare Compute Node
Du
al
Fib
er
Ch
an
nel
SQL
Management Nodes
Corporate Network Private NetworkBlazing fast performance by parallelizing queries on highly
optimized shared nothing nodes
Multiple queries are simultane-ously executed across all nodes.
PDW supportsquerying whiledata is loading.
?
?
??
?
?
?
? ????
? ???
??? ? ??????? ? ??????? ? ????
??? ? ??????? ? ????
??? ? ??????? ? ??????? ? ????
??? ? ??????? ? ????
Data Layout Approaches
• ReplicatedA table structure that exists as a full copy within each discrete PDW Node.
• DistributedA table structure that is hashed on a single column and uniformly distributed across all nodes on the appliance. Each distribution is a separate physical table in the DBMS.
• Ultra Shared NothingThe ability to design a schema of both distributed and replicated tables to minimize data movement between nodes .
− Small sets of data can be more efficiently stored in full (replicated).
− Certain set operations (i.e., single-node operations) are more efficient against full sets of data.
Software Architecture
Compute NodesCompute Nodes
Compute Node
Query Tool
MS BI(AS, RS)
Control Node
Other Third-Party Tools
DWSQL
Landing Zone Node
Internet
Explorer
SQL Server
DW Authenticati
on
DW Configuratio
n
DW Schema
TempDB
SQL ServerUser Data
Data Movement Service
Data Movement Service
Data Movement Service
MPP Engine Coordinator
IISAdmin
Console
Data Access (OLEDB, ODBC, ADO.NET,
JDBC)
MPP Engine CoordinatorProvides single system imageSQL compilationGlobal metadata and appliance configurationGlobal query optimization and plan generationGlobal query execution coordinationGlobal transaction coordinationAuthentication and authorizationSupportability (hardware and software status)
Data Movement ServiceData movement across the
applianceDistributed query execution
operators
SQLParser
DMS Manag
er
Core Engine Service
sBackup Node
Data Movement Service
SQL Server 2012 DWColumnstore Indexes
Blazing-Fast Performance
¹Source: Microsoft customer evidence, Choice Hotels International²Source: Microsoft customer evidence, KAS Bank³Source: Microsoft customer testing; common data warehousing queries
ColumnStoreIndexNew
transactions per day²
100,000,000
10xNow, up to Faster³
transactions per second¹
57,000“400 percent improvement in performance.”First American Title
Insurance Company
Columnstore Indexes:Fetch Only Needed Columns
SELECT ProductKey, SUM (SalesAmount) FROM SalesTable WHERE OrderDateKey < 20101108
StoreKey
01
04
04
03
05
02
StoreKey
02
03
01
04
04
01
RegionKey
1
2
2
2
3
1
RegionKey
1
2
1
2
2
1
Quantity
6
1
2
1
4
5
Quantity
1
5
1
4
5
1
OrderDateKey
20101107
20101107
20101107
20101107
20101107
20101108
OrderDateKey
20101108
20101108
20101108
20101109
20101109
20101109
ProductKey
106
103
109
103
106
106
ProductKey
102
106
109
106
106
103
SalesAmount
30.00
17.00
20.00
17.00
20.00
25.00
SalesAmount
14.00
25.00
10.00
20.00
25.00
17.00
Columnstore Indexes:Processing Data
• Columnstore vs. Rowstore• New Batch Processing in
Engine• Columnstore index scan can
produce batches or rows−Batch-enabled operators get
batches−Non-batch operators get rows
• Query optimizer decides List
of
qualif
yin
g
row
s
Column vectors
Batch object
41
SSRS Scale OutMicrosoft SQL Server BI
SSRS Recommended Scale-Out 1 of 3
In a standard scale-out server deployment, multiple report servers share a single report server database. The report server database should be installed on a remote SQL Server instance. The following diagram is an example of a standard scale-out server deployment configuration with the report server database on a remote SQL Server instance.
SSRS Recommended Scale-Out 2 of 3
As another option, you might decide to host the report server database on a SQL Server instance that is part of a failover cluster. The following diagram is an example of a scale-out server deployment configuration where the report server databases are on an instance that is part of a failover cluster.
SSRS Recommended Scale-Out 3 of 3In addition to the standard scale-out deployment, you might determine that your reporting environment would benefit from a more advanced scale-out deployment configuration. For example, you might decide to use the load-balanced report servers for interactive report processing and add a separate report server computer to process only scheduled reports. The following diagram is an example of this advanced scale-out server deployment configuration.
SSRS LogsLog Description
Report Server Execution Log The report server execution log contains data about specific reports, including when a report was run, who ran it, where it was delivered, and which rendering format was used.The execution log is stored in the report server database.
Report Server Service Trace Log The service trace log contains very detailed information that is useful if you are debugging an application or investigating an issue or event. The file is located at \Microsoft SQL Server\<SQL Server Instance>\Reporting Services\LogFiles.
Report Server HTTP Log
The HTTP log file contains a record of all HTTP requests and responses handled by the Report Server Web service and Report Manager. HTTP logging is not enabled by default. You must modify the ReportingServicesService.exe configuration file to use this feature in your installation. The file is located at \Microsoft SQL Server\<SQL Server Instance>\Reporting Services\LogFiles.
SSAS Backup & RecoverMicrosoft SQL Server BI
SSAS Backups
• Take full backups on as regular a basis as possible− Creating a backup file is an I/O intensive operation. You can improve
performance by writing the backup to a high-speed disk drive or put the backup file on a separate drive to avoid having read operations (from users) collide with the backup’s write operations
• Synchronize to a secondary AS server• Reprocess cube from source code• Detach/Attach• In SSMS, right-click an SSAS database and click Script
Database to create an XMLA script that can be executed later to re-create the SSAS database. This approach also requires that you reprocess the SSAS database.
• ASCMD
SSAS Deployment Best Practices
• SSAS LOVES RAM− 64-bit hardware is recommended− Monitor your SSAS instances memory usage with these
perfmon counters (make sure they are SSAS, not MSSQL):− MSAS2008:Memory\Memory Usage Kb− MSAS2008:Memory\Memory Limit Low Kb− MSAS2008:Memory\Memory Limit High Kb
• $SYSTEM.MDSCHEMA_CUBES• $SYSTEM.MDSCHEMA_DIMENSIONS• $SYSTEM.MDSCHEMA_FUNCTIONS• $SYSTEM.MDSCHEMA_HIERARCHIES• $SYSTEM.MDSCHEMA_INPUT_DATASOURCES• $SYSTEM.MDSCHEMA_KPIS• $SYSTEM.MDSCHEMA_LEVELS• $SYSTEM.MDSCHEMA_MEASUREGROUP_DIMENSIONS• $SYSTEM.MDSCHEMA_MEASUREGROUPS• $SYSTEM.MDSCHEMA_MEASURES• $SYSTEM.MDSCHEMA_MEMBERS• $SYSTEM.MDSCHEMA_PROPERTIES• $SYSTEM.MDSCHEMA_SETS
SSAS General Best Practices
• Turn off Flight Recorder in Production• Align SSAS partitioning with source DB partitions• Consider “Lock Pages in Memory”, Windows Group
Policy setting for SSAS (and SQL Server, too)• Partitioning• Incremental processing• Synchronize processing server with query server via
backup/restore or detach/attach• http://msftasprodsamples.codeplex.com
Network Packet Size Helps for SSIS & SSAS
Under the properties of your data source, increasing the network packet size for SQL Server minimizes the protocol overhead require to build many, small packages. The default value for SQL Server 2008 is 4096. With a data warehouse load, a packet size of 32K (in SQL Server, this means assigning the value 32767) can benefit processing. Don’t change the value in SQL Server using sp_configure; instead override it in your data source. This can be set whether you are using TCP/IP or Shared Memory.
SQL Server Integration Services
Microsoft SQL Server BI
SSIS Performance• SQLServer:SSIS Service• SSIS Package Instances - Total number of simultaneous SSIS Packages running• SQLServer:SSIS Pipeline• BLOB bytes read - Total bytes read from binary large objects during the monitoring
period.• BLOB bytes written - Total bytes written to binary large objects during the
monitoring period.• BLOB files in use - Number of binary large objects files used during the data flow
task during the monitoring period.• Buffer memory - The amount of physical or virtual memory used by the data flow
task during the monitoring period.• Buffers in use - The number of buffers in use during the data flow task during the
monitoring period.• Buffers spooled - The number of buffers written to disk during the data flow task
during the monitoring period.• Flat buffer memory - The total number of blocks of memory in use by the data flow
task during the monitoring period.• Flat buffers in use - The number of blocks of memory in use by the data flow task at
a point in time.• Private buffer memory - The total amount of physical or virtual memory used by
data transformation tasks in the data flow engine during the monitoring period.• Private buffers in use - The number of blocks of memory in use by the
transformations in the data flow task at a point in time.• Rows read - Total number of input rows in use by the data flow task at a point in
time.• Rows written - Total number of output rows in use by the data flow task at a point in
time.
“Buffers in use”, “Flat buffers in use” and “Private buffers in use” are useful to discover leaks. During package execution time, you will see these counters fluctuating. But once the package finishes execution, their values should return to the same value as what they were before the execution. Otherwise, buffers are leaked
SSIS Best Practices
• Measure CPU Usage for DTExec− Process / % Processor Time (Total)
• Measure Network Throughput− Network Interface / Current Bandwidth: This counter provides an
estimate of current bandwidth.− Network Interface / Bytes Total / sec: The rate at which bytes are
sent and received over each network adapter.− Network Interface / Transfers/sec: Tells how many network
transfers per second are occurring. If it is approaching 40,000 IOPs, then get another NIC card and use teaming between the NIC cards.
• In SSIS, use NOLOCK or TABLOCK hints to remove locking overhead
• Be careful of SSIS package scheduling conflicts• Data flows in bulk instead of row-by-row• Consider ELT – Extract, Load & Transform using staging tables
Business Intelligence Architectures
In-memory models & reporting
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,
it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Recommended