59
Data Warehouses and Multi-Dimensional Data Analysis Raimonds Simanovskis @rsim

Data Warehouses and Multi-Dimensional Data Analysis

  • Upload
    rsim

  • View
    478

  • Download
    4

Embed Size (px)

Citation preview

Data Warehouses and Multi-Dimensional

Data Analysis

Raimonds Simanovskis@rsim

Vampireslive here

500kmlong beach

Other vampireslive here

(310.686 miles)

Data Warehouses and Multi-Dimensional

Data Analysis

Raimonds Simanovskis@rsim

Sales app exampleclass Customer < ActiveRecord::Base has_many :orders end

class Order < ActiveRecord::Base belongs_to :customer has_many :order_items end

class OrderItem < ActiveRecord::Base belongs_to :order belongs_to :product end

class Product < ActiveRecord::Base belongs_to :product_class has_many :order_items end

class ProductClass < ActiveRecord::Base has_many :products end

Database schema

One day CEO asks a question…

What were the total sales amounts

in California in Q1 2014

by product families?

Total sales amount …

OrderItem.sum("amount")

… in California …

OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). sum("order_items.amount")

… in Q1 2014 …

OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). sum("order_items.amount")

… by product families

OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")

Generated SQLOrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")

SELECT SUM(order_items.amount) AS sum_order_items_amount, product_classes.product_family AS product_classes_product_family FROM "order_items" INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id" INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id" INNER JOIN "products" ON "products"."id" = "order_items"."product_id" INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id" WHERE "customers"."country" = 'USA' AND "customers"."state_province" = 'CA' AND (extract(YEAR FROM orders.order_date) = 2014) AND (extract(quarter FROM orders.order_date) = 1) GROUP BY product_classes.product_family

OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost”). map{|i| i.attributes.compact}

… and alsosales cost?

OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}

… and unique customers

count?

Is it clear?

#@%$^&OrderItem.joins(:order => :customer).where("customers.country" => "USA","customers.state_province" => "CA").where("extract(year from orders.order_date)= ?", 2014).where("extract(quarter from orders.order_date)= ?", 1).joins(:product => :product_class).group("product_classes.product_family").select("product_classes.product_family,"+

"SUM(order_items.amount) AS sales_amount,"+"SUM(order_items.cost) AS sales_cost,"+"COUNT(DISTINCT customers.id) AS

customers_count").map{|i| i.attributes.compact}

Performance slows down on larger data volumes

$ rails console >> OrderItem.count (677.0ms) SELECT COUNT(*) FROM "order_items" => 6218022 >> Order.count (126.0ms) SELECT COUNT(*) FROM "orders" => 642362 >> OrderItem.joins(:order => :customer). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}

OrderItem Load (25437.0ms) ...

6 million rows

25 seconds

You should use NoSQL !

You should use NoSQL !

Dimensional Modeling

Deliver data that’s understandable to the business users

Deliver fast query performance

Dimensional Modeling

What were the total sales amounts

in California in Q1 2014

by product families?

fact or measureCustomer / Region dimension

Time dimensionProduct dimension

Data Warehouse“Star schema” with

fact and dimension tables

“Snowflake schema”

Data Warehouse Models

class Dwh::SalesFact < Dwh::Fact belongs_to :customer, class_name: "Dwh::CustomerDimension" belongs_to :product, class_name: "Dwh::ProductDimension" belongs_to :time, class_name: "Dwh::TimeDimension" end

class Dwh::CustomerDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "customer_id" end

class Dwh::ProductDimension < Dwh::Dimension has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id" belongs_to :product_class, class_name: "Dwh::ProductClassDimension" end

class Dwh::ProductClassDimension < Dwh::Dimension has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id" end

class Dwh::TimeDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "time_id" end

Load Dimensionclass Dwh::CustomerDimension < Dwh::Dimension # ... def self.truncate! connection.execute "TRUNCATE TABLE #{table_name}" end

def self.load! truncate! column_names = %w(id full_name city state_province country birth_date gender created_at updated_at) connection.insert %[ INSERT INTO #{table_name} (#{column_names.join(',')}) SELECT #{column_names.join(',')} FROM #{::Customer.table_name} ] end end

Generate Time Dimension

class Dwh::TimeDimension < Dwh::Dimension def self.load! connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end end

Load Factsclass Dwh::SalesFact < Dwh::Fact def self.load! truncate!

connection.insert %[ INSERT INTO #{table_name} (customer_id, product_id, time_id, sales_quantity, sales_amount, sales_cost) SELECT o.customer_id, oi.product_id, CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER), oi.quantity, oi.amount, oi.cost FROM #{OrderItem.table_name} oi INNER JOIN #{Order.table_name} o ON o.id = oi.order_id ] end end

What were the total sales amounts

in California in Q1 2014

by product families?

Dwh::SalesFact. joins(:customer).joins(:product => :product_class).joins(:time). where("d_customers.country" => “USA", "d_customers.state_province" => "CA"). where("d_time.year" => 2014, "d_time.quarter" => 1). group("d_product_classes.product_family"). sum("sales_amount")

Two-Dimensional Table

CellRows

Columns

Multi-Dimensional Data Model

DimensionDimension

Dim

ensi

on

Measures

Data cube

Multi-Dimensional Data Model

TimeProduct

Cus

tom

er MeasuresSales quantitySales amount

Sales costCustomers count

Sales cube

Dimension Hierarchies

All Customers

USA Canada

WA CA OR

San Francisco Los Angeles

Country

All

State

City

Levels

Time DimensionAll Times

2014 2015

Q2 Q3 Q4

AUG SEP

Year

All

Quarter

Month

AUG 01 AUG 02 Day

Q1

JUL

Defaulthierarchy

All Times

2014 2015

W2 W3 W4

JAN 18 JAN 19

Year

All

Week

Day

W1

JAN 17

Weeklyhierarchy

OLAP TechnologiesOn-Line Analytical Processing

Mondrianhttp://community.pentaho.com/projects/mondrian/

https://github.com/rsim/mondrian-olap

mondrian-olap gem

Mondrian::OLAP::Schema.define do cube 'Sales' do table 'f_sales', schema: 'dwh'

dimension 'Customer', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Customers', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Country', column: 'country' level 'State Province', column: 'state_province' level 'City', column: 'city' level 'Name', column: 'full_name' end end

dimension 'Product', foreign_key: 'product_id' do hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do join left_key: 'product_class_id', right_key: 'id' do table 'd_products', schema: 'dwh' table 'd_product_classes', schema: 'dwh' end level 'Product Family', table: 'd_product_classes', column: 'product_family' level 'Product Department', table: 'd_product_classes', column: 'product_department' level 'Product Category', table: 'd_product_classes', column: 'product_category' level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory' level 'Brand Name', table: 'd_products', column: 'brand_name' level 'Product Name', table: 'd_products', column: 'product_name' end end

dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do hierarchy all_member_name: 'All Time', primary_key: 'id' do table 'd_time', schema: 'dwh' level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears' level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters' level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths' level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays' end end

measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum' measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum' measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum' measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count' end end

mondrian-olap schema definition

What were the total sales amounts

in California in Q1 2014

by product families?

olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")

MDX Query Languageolap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")

SELECT {[Measures].[Sales Amount]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])

Results Caching

SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (21713.0ms)

SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (10.0ms)

Additional Attribute Dimensiondimension 'Gender', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Genders', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Gender', column: 'gender' do name_expression do sql "CASE d_customers.gender WHEN 'F' THEN ‘Female' WHEN 'M' THEN ‘Male' END" end end end end

olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Gender].[Gender].Members")

Dynamic Attribute Dimensiondimension 'Age interval', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Age', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Age interval' do key_expression do sql %[ CASE WHEN age(d_customers.birth_date) < interval '20 years' THEN '< 20 years' WHEN age(d_customers.birth_date) < interval '30 years' THEN '20-30 years' WHEN age(d_customers.birth_date) < interval '40 years' THEN '30-40 years' WHEN age(d_customers.birth_date) < interval '50 years' THEN '40-50 years' ELSE '50+ years' END ] end end end end

[Age interval].[<20 years] [Age interval].[20-30 years] [Age interval].[30-40 years] [Age interval].[40-50 years] [Age interval].[50+ years]

Calculation Formulascalculated_member 'Profit', dimension: 'Measures', format_string: '#,##0.00', formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]'

calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%', formula: '[Measures].[Profit] / [Measures].[Sales Amount]'

olap.from("Sales"). columns("[Measures].[Profit]", "[Measures].[Margin %]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")

Enables Ad-hoc Queries by Users

ETL process

DataWarehouse

Measures

Dimension1 Dimension2

Dimension4Dimension3

Database

REST API

Extract Transform Load

Ruby Tools for ETL

Kiba http://www.kiba-etl.org/

https://github.com/square/ETLETL

Kiba example# declare a ruby method here, for quick reusable logic def parse_french_date(date) Date.strptime(date, '%d/%m/%Y') end

# or better, include a ruby file which loads reusable assets # eg: commonly used sources / destinations / transforms, under unit-test require_relative 'common'

# declare a source where to take data from (you implement it - see notes below) source MyCsvSource, 'input.csv'

# declare a row transform to process a given field transform do |row| row[:birth_date] = parse_french_date(row[:birth_date]) # return to keep in the pipeline row end

# declare another row transform, dismissing rows conditionally by returning nil transform do |row| row[:birth_date].year < 2000 ? row : nil end

# declare a row transform as a class, which can be tested properly transform ComplianceCheckTransform, eula: 2015

# before declaring a definition, maybe you'll want to retrieve credentials

Multithreaded ETL

https://github.com/ruby-concurrency/concurrent-ruby

ExtractThreadPool

TransformThreadPool

LoadThreadPool

Data source

Extracted data

Transformed data

Pro-tip: Use

Single threaded

ETL

class Dwh::TimeDimension < Dwh::Dimension def self.load! logger.silence do connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date(date) end end end

def self.insert_date(date) year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end

require 'concurrent/executors'

class Dwh::TimeDimension < Dwh::Dimension

def self.parallel_load!(pool_size = 4) logger.silence do insert_date_pool = Concurrent::FixedThreadPool.new(pool_size)

connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date_pool.post(date) do |date| connection_pool.with_connection do insert_date(date) end end end

insert_date_pool.shutdown insert_date_pool.wait_for_termination end end

end

ETL with Thread Pool

Benchmark!

Dwh::TimeDimension.load! (5236.0ms) Dwh::TimeDimension.parallel_load!(2) (3450.0ms) Dwh::TimeDimension.parallel_load!(4) (2142.0ms) Dwh::TimeDimension.parallel_load!(6) (2361.0ms) Dwh::TimeDimension.parallel_load!(8) (2826.0ms)

optimal sizein this case

Java Mission Control

Traditional vs Analytical Relational Databases

Optimized fortransaction processing

Optimized foranalytical queries

Row-based Storage

Columnar Storage

http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html

Analytical Query PerformanceSELECT d_product_classes.product_family, SUM(f_sales.sales_amount) AS sales_amount, SUM(f_sales.sales_cost) AS sales_cost, COUNT(DISTINCT f_sales.customer_id) AS customers_count FROM "dwh"."f_sales" INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" = "dwh"."f_sales"."product_id" INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" = "dwh"."d_products"."product_class_id" GROUP BY d_product_classes.product_family

always ~18 seconds

first ~9 seconds next ~1.5 seconds

6 million rows

When to use what?

Fact table sizeTraditional

transactional databases

Analytical columnar databases

< 1M rows OK No big win

1-10M rows Complex queries slower OK

10-100M rows Slow OK

>100M rows Very slow OK with tuning

What did we cover?

Problems with analytical queriesDimensional modeling

Star schemasMondrian OLAP and MDX

ETL – Extract, Transform, LoadAnalytical columnar databases

Questions?

[email protected]@rsim github.com/rsim

https://github.com/rsim/sales_app_demo