Featherman’s T-SQL Analytics: Comparing GROUP BY ... · Web viewGroup BY queries can provide many columns of aggregated results such as counts, totals, averages. Moving averages,

More Examples of GROUP BY Querying and introducing T-SQL Pivoting Featherman©

This document provides more examples of GROUP BY () querying, and examples of different Excel pivot charts, Bing Maps, and Tableau visualizations. Tables, charts and maps are the common ways to visualize the results provided by GROUP BY () query. At a conceptual level the GROUP BY() queries create a wide table of different calculated fields all about an entity (grouping data on StoreID, EmployeeID, etc.). Other documents will demonstrate how to integrate Excel and SQL Data but for the moment here are some of the different strategies to get the data from SQL Server into an Excel visualization. There are similar methodologies for Tableau and PowerBI Desktop.

Write the query in SSMS and copy the results generated into a sheet in Excel. Next highlight the data and select Insert | Pivot Chart Write the query in SSMS and in Excel select PowerQuery, and paste the TSQL into PowerQuery and run it. The dataset will appear and be

placed into a table in an Excel sheet In Excel connect to the database and select as many tables as you need, adding the tables to a data model. Then highlight the data and

select insert pivot chart, recommended chart, or Power Map In Excel click PowerPivot and get external data and either run a custom query or select tables to import to PowerPivot (in the process, you

have the option to add/remove columns and filter out rows). In Excel add a VBA module and run the SQL statement from the VBA module

After reviewing GROUP BY () functionality, Next we compare GROUP BY() functions and PIVOT() functions within T-SQL. GROUP BY () resultsets and the demonstrated charting is compared with the simpler concept of pivoting data into

tables.

When pivoting data you perform a great deal of GROUP BY () queries by using every combination of two different dimensions. In the example shown here we are grouping sales by two related dimensions, year and month. So in effect we are running 24 GROUP BY () queries one for each month shown. You can also pivot data on two different dimensions such as geography and product sub-category.

We start by reviewing and comparing GROUP BY() and PIVOT() then show examples of each.

GROUP BY () Queries

Used to provide grouped, aggregated, summary, compiled calculations based on whatever group is specified. Can be high level grouping (ie country), lower level grouping (ie state within country) (or city within state, within country), or more detailed (ie reseller store within state (uses the GROUP BY query).

Group BY queries can provide many columns of aggregated results such as counts, totals, averages. Moving averages, ranks, etc.) The aggregations can be calculated for different grouping levels, where the chosen grouping outputs one row of aggregated data per grouping of dimension attributes (such as geographic region, product line, employee, project, etc.).

If you have one level of grouping (say country) then the totals will be at a high level (CEO level). This minimal grouping may be visualized on an executive’s dashboard. If you have 2 or 3 levels of grouping then you have more rows of more detailed data (i.e., within each country you have customers grouped within cities, where one record per customer is calculated and provided in the dataset. Each customer can have many columns of metrics added.

If you are creating a stored procedure or view which will provide a dataset of query results (aka resultset) for subsequent reporting and pivoting then you can also include fewer levels of grouping (lower granular data) in your query to flatten out the any hierarchy built into a dimension and show for example all levels of geographic data (continent, country, region, city) as more columns in the same query.

Advantages: Can have many different aggregated analytics for any dimension that is being evaluated. Provides an easy solution for commonly needed analytics. Good to build resultsets for reports & spreadsheets.

Can include many columns of aggregated values - useful for creation of multiple KPI columns that are be the datasource for a KPI dashboard

Disadvantages: GROUP BY will not solve all aggregation needs. Different aggregations may be needed, and the requirement that ALL the non-aggregate columns in the SELECT statement must be in the GROUP BY statement can restrict analysis – requiring need for many queries (until you learn table variables)

PIVOT ()

Used to aggregate large amounts of data into a cross-tabulated table format. Pivot queries perform A LOT of functionality and are well worth the effort to learn how to use them. While a programmer may be tempted to use nested loops to perform calculations for any pair of row and column dimensions, the PIVOT function is very compact and powerful. The measures can be Sums, Counts, Min Max, Standard Deviations, and other custom calculations. Pivoting data in general with Excel and other programs is very helpful.

Pivot queries provide a static compacted table of results displaying ONE aggregated value in a tabular format. The aggregation for the ONE calculated value (the fact) is based on all the combinations of two other dimensions. For example production units can be shown for month and machine Total Sales units sold can be totaled for city and product line.

Pivot queries are commonly used by DBA’s (database administrators) in ETL routines (Extract Transform and Load) to pre-compile data before merging. Very useful to aggregate more detailed (lower granular) transactions data to merge with higher grain data (i.e., state level). SSIS has a PIVOT transformation.

Advantages: Compacts a ton of data down into a small footprint table, which is the perfect datasource for a column or line chart. Pivoted data is very commonly the data format needed. You can also compact data from different sources with PIVOT() then you can combine them.

Can aggregate massive amounts of data, which often local data analyzed in a pivot table in excel, tableau visualization or an SSRS crosstab report cannot handle unless these data visualization tools are connecting to a database.

Disadvantages:

Unlike GROUP BY queries which are pretty intuitive, the PIVOT () code takes a bit of time to get used to. Currently require advanced dynamic SQL to parameterize. Typically can have only have one aggregated measure in the crosstab (ie pivot table).

FAQ:

Q1: Why not just use an Excel Pivot table - they are easier to deal with.

Answer: Right many times an Excel pivot table is the preferred way to go. The analyst can even receive a 1 gigabyte .csv database extract file from the DBA and then save it onto their development machine as a 2016 Excel workbook IF THEY USE POWER PIVOT – powerpivot is a

datawarehouse-like tabular storage engine. The analyst can import millions and millions of rows inside the Excel file. As the size of the data to be analyzed grows, you may experience Excel out of memory errors and crashes. Then its time to leave the dat in the corporate databse where it belongs (and is secure, providing the single version of the financials and operations data that all departments write their budgets and reports with. A common managerial malady in US corporate businesses is that different managers and analysts get their own database extract and massage the numbers, an more commonly the data is not kept current, so even day old reports can contribute to management errors.

If you connect a pivot table to a cloud-based SQL Server Management Studio (or Oracle) database, the pivot processing should be fast. Of course you have to ask your boss or the DBA to secure user rights to the cloud database which may be problematic (just ask the DBA for read-only access, and give them a dozen donuts). If you are an analyst more interested in using the results and then deriving more metrics or meaning from the data, then use Excel Pivot tables. You can complement the pivot tables and pivot charts with maps, VBA functions, DAX formulas, and more.

The analyst may be able to provide the dashboards, analysis and managerial recommendations by providing the requester a copy of an Excel workbook that has an OLE-DB connection to the cloud-based database with an Excel pivot table and pivot charts, you can even use Bing Maps inside PowerView inside Excel 2013 and later). The database connections can be set to refresh the data whenever the program is opened. Distributing Excel spreadsheets that have READ ONLY database connections to corporate data warehouses (intranet or cloud-based) is common in corporate America.

The benefit of adding PIVOT() SQL programming tool to your toolbox is that you will use over and over in unforeseen job tasks. PIVOT() is a standard function in ORACLE, SSMS< and every other relational database. Hey and analyst or DBA often to pre-process large amounts of data (especially at corporate HQ) so pivoting data may just be an intermediate step to proving a more complex dashboard and set of reports. You can use PIVOT() inside longer data transformation query processes.

Other times the PIVOT query is used in the SSIS process of building a data warehouse...wherein you implement several data transformation steps to aggregate the data for different levels of management.

WHY NOT JUST USE A GROUP BY QUERY?

Group by queries produce tall tables with many rows that you have to scroll down. The metrics are usually all at the same level of detail (ie state level or store level). A benefit is that you can create 100 different types of calculated fields, adding column after column of calculations all stored in the resultset array. Pivots give a more compact, spreadsheet-like response that managers are accustomed to seeing.

GROUP BY queries do have an advantage though, they allow you to aggregate different calculations for each group (here A COUNT, SUM AND AVERAGE). Pivots just aggregate one number.

Because so many columns and aggregations can be provided, often GROUP BY queries are used in a first step of data transformation where

literally data is pulled from different fact and dimension tables. So GROUP BY queries create very orderly columns of whatever calculation you need. GROUP BY query results can be a first data transformation that result in good base aggregated data. Next you can often use a Pivot tables to enable further processing. Data has to be harmonized meaning it is aggregated at the same layer of business detail, for example transaction data can be summarized into store, city, and state level calculations, which make great pivot chart reports and Bing maps, or Tableau Maps.

Featherman has provided you public access to the abovementioned data warehouse. WSU students (outside Todd Hall) need to open up an Internet connection then a vpn connection (you will have to download and install this software from wsu.edu) then open the SQL Server Management Studio software and then specify the server name = cb-ot-devst03.ad.wsu.edu, login = mfstudent password = BIanalyst. If this methodology does not work then try to connect to Featherman’s SQL Azure Database that has the same tables. Server name = mbastudents.database.windows.net, login is mbastudents, password is MbAStud@!1 If you use the SQL Azure database, skip the Use [Adventureworks2012] line of code, do not run that line.

Look at another group of great SQL GROUP BY () queries and the graphs and maps that are just a pivot chart away. While reading this document be sure to copy each query into SSMS and run it to verify results. Use the login info above.

Quick Review of the Ease of Data Management using a One Table GROUP BY() query

[AdventureWorksDW2012];

SELECT YEAR(OrderDate) AS [OrderYear], Month(OrderDate) AS [Month#], DATENAME(MONTH,OrderDate) AS [Month], COUNT([SalesOrderNumber]) AS [# Orders], FORMAT(AVG([SalesAmount]), 'N0') AS [Avg. Sale for Month], FORMAT(SUM([SalesAmount]),'N0') AS [Monthy Total]

FROM [dbo].[FactResellerSales] as s

The data from this sample AdventureWorksDW2012 dataset is grouped by year and month. The company had sales starting in July 2005. The query on the left

This is the same data in an Excel pivot chart. The blue template just popped up and was easy to choose. Look how easy it is to drag the Order Year and Month onto the pivot chart and see a line chart that shows three years of data.

Here is more pivot charts that was created from the exact same TSQL resultset. This viz that looks at seasonality of the data over the three year’s data.

GROUP BY YEAR(OrderDate), Month(OrderDate), DATENAME(MONTH,OrderDate)

ORDER BY OrderYear, [Month#]

This first query from the same data set shows the seasonality of the data. The analyst must be careful to compare apples to apples, so the data set was pared down to the two years where there was data for all the 12 months of the year. Another spreadsheet improvement is to add a three month moving average.

A final addition shown here is the use of a slicer to see the trend in # of sales per month for any single year or selection of years (hold down the control key and select more than one year in the slicer control).

USE [AdventureWorksDW2012]; SELECT YEAR(OrderDate) AS [OrderYear], DATEPART(WEEK, OrderDate) AS [Week#], SUM([SalesAmount]) as [Weekly Sales]

, COUNT(DISTINCT([SalesOrderNumber])) as [# Sales per Week], COUNT([SalesOrderLineNumber]) as [# Items per Week]

, COUNT([SalesOrderLineNumber]) / COUNT(DISTINCT([SalesOrderNumber])) as [Avg. # Items per Reseller Order]

, SUM([SalesAmount]) / COUNT(DISTINCT([SalesOrderNumber])) as [Avg. Sale For Week]

FROM [dbo].[FactResellerSales]

GROUP BY YEAR(OrderDate), DATEPART(WEEK, OrderDate)ORDER BY [OrderYear], [Week#]

It is very helpful to look at sales or other metrics by week rather than by month. Also look at the code it takes to calculate the number of orders per week. Look at the sample of pivot charts you can run from this simple query.

Notice the COUNT(DISTINCT(MONTH([OrderDate]))) code. The fact table has >60,000 rows of data. One instance of each month that a sales order was made is counted yielding proof that the data for 2008 is only 6 months. So Sales are not going down, we just have only 6 months of data.

Also notice the COUNT(DISTINCT([SalesOrderNumber])) as [# Sales Orders for Week] code. Each sales order for the reseller channel are bike shops that buy a lot of different products on periodic orders

USE [AdventureWorksDW2012];

SELECT [SalesOrderNumber], COUNT([SalesOrderLineNumber]) AS [# Items on Ticket]FROM [dbo].[FactResellerSales]GROUP BY [SalesOrderNumber]

This query totals the number of line items each sales order had. This data can be aggregated in many different ways. While this is easy if you want the SalesOrderNumber in the GROUP BY line, if you want to group on something else then use the code above.

USE [AdventureWorksDW2012];

SELECT [StateProvinceName] as [State], [City], SUM([SalesAmount]) AS [Sales Amount]

FROM [dbo].[FactResellerSales] as sINNER JOIN [dbo].[DimReseller]as r ON r.[ResellerKey]= s.[ResellerKey]INNER JOIN [dbo].[DimGeography] as g ON g.[GeographyKey]= r.GeographyKey

WHERE [EnglishCountryRegionName] = 'United States'GROUP BY [EnglishCountryRegionName], [StateProvinceName], [City]ORDER BY [State]

Here is one of the best reasons to remember how to use GROUP BY() SQL queries. Look how easily you can total sales by city, and use a map inside Excel.

If you copy the resultset from SSMS into Excel and Select Insert Power Map you can create this map very easily.

OK. So by now do you get the picture that analysts use GROUP BY queries all the time? Ok now it’s time to pivot to another topic.

Compare the format of the above GROUP BY query and the results of a T-SQL Pivot query, below. The main difference is that GROUP BY queries can give back hundreds of rows, and many columns, but PIVOTS give back more compacted data. HOWEVER, the drawback is that only one number can be calculated for each cell (like a count or a sum). You typically only use three fields in the SELECT statement 1) the row heading from a dimension table 2) the column heading from a different dimension and 3) the measure that is being

aggregated. Take a look at the select statement below, and also notice there is no GROUP BY statement. The pivot means that the group by is being performed for each cell using the intersection of the row and column attributes to calculate each cell.

USE [AdventureWorksDW2012];SELECT * from (SELECT YEAR(OrderDate) AS OrderYear, DATENAME(MONTH,OrderDate) AS [MonthName], [SalesAmount]FROM [dbo].[FactResellerSales]) AS BaseDataTable

PIVOT(SUM([SalesAmount])

FOR [MonthName] IN(January,February,March,April,May, June,July, August,September,October, November,December)

) AS PivotTable

Results are again shown in this pivot table - but only sales are totaled by month and year, there are no counts or average sales for the month. There aren’t 2 or 3 columns for each month. Move to the next document to learn about PIVOT queries

Documents

Featherman’s T-SQL Analytics: Comparing GROUP BY ... · Web viewGroup BY queries can provide many columns of aggregated results such as counts, totals, averages. Moving averages,