Upload
rsim
View
478
Download
4
Embed Size (px)
Citation preview
Sales app exampleclass Customer < ActiveRecord::Base has_many :orders end
class Order < ActiveRecord::Base belongs_to :customer has_many :order_items end
class OrderItem < ActiveRecord::Base belongs_to :order belongs_to :product end
class Product < ActiveRecord::Base belongs_to :product_class has_many :order_items end
class ProductClass < ActiveRecord::Base has_many :products end
One day CEO asks a question…
What were the total sales amounts
in California in Q1 2014
by product families?
… in California …
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). sum("order_items.amount")
… in Q1 2014 …
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). sum("order_items.amount")
… by product families
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")
Generated SQLOrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")
SELECT SUM(order_items.amount) AS sum_order_items_amount, product_classes.product_family AS product_classes_product_family FROM "order_items" INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id" INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id" INNER JOIN "products" ON "products"."id" = "order_items"."product_id" INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id" WHERE "customers"."country" = 'USA' AND "customers"."state_province" = 'CA' AND (extract(YEAR FROM orders.order_date) = 2014) AND (extract(quarter FROM orders.order_date) = 1) GROUP BY product_classes.product_family
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost”). map{|i| i.attributes.compact}
… and alsosales cost?
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}
… and unique customers
count?
Is it clear?
#@%$^&OrderItem.joins(:order => :customer).where("customers.country" => "USA","customers.state_province" => "CA").where("extract(year from orders.order_date)= ?", 2014).where("extract(quarter from orders.order_date)= ?", 1).joins(:product => :product_class).group("product_classes.product_family").select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+"SUM(order_items.cost) AS sales_cost,"+"COUNT(DISTINCT customers.id) AS
customers_count").map{|i| i.attributes.compact}
Performance slows down on larger data volumes
$ rails console >> OrderItem.count (677.0ms) SELECT COUNT(*) FROM "order_items" => 6218022 >> Order.count (126.0ms) SELECT COUNT(*) FROM "orders" => 642362 >> OrderItem.joins(:order => :customer). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}
OrderItem Load (25437.0ms) ...
6 million rows
25 seconds
Dimensional Modeling
Deliver data that’s understandable to the business users
Deliver fast query performance
Dimensional Modeling
What were the total sales amounts
in California in Q1 2014
by product families?
fact or measureCustomer / Region dimension
Time dimensionProduct dimension
Data Warehouse Models
class Dwh::SalesFact < Dwh::Fact belongs_to :customer, class_name: "Dwh::CustomerDimension" belongs_to :product, class_name: "Dwh::ProductDimension" belongs_to :time, class_name: "Dwh::TimeDimension" end
class Dwh::CustomerDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "customer_id" end
class Dwh::ProductDimension < Dwh::Dimension has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id" belongs_to :product_class, class_name: "Dwh::ProductClassDimension" end
class Dwh::ProductClassDimension < Dwh::Dimension has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id" end
class Dwh::TimeDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "time_id" end
Load Dimensionclass Dwh::CustomerDimension < Dwh::Dimension # ... def self.truncate! connection.execute "TRUNCATE TABLE #{table_name}" end
def self.load! truncate! column_names = %w(id full_name city state_province country birth_date gender created_at updated_at) connection.insert %[ INSERT INTO #{table_name} (#{column_names.join(',')}) SELECT #{column_names.join(',')} FROM #{::Customer.table_name} ] end end
Generate Time Dimension
class Dwh::TimeDimension < Dwh::Dimension def self.load! connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end end
Load Factsclass Dwh::SalesFact < Dwh::Fact def self.load! truncate!
connection.insert %[ INSERT INTO #{table_name} (customer_id, product_id, time_id, sales_quantity, sales_amount, sales_cost) SELECT o.customer_id, oi.product_id, CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER), oi.quantity, oi.amount, oi.cost FROM #{OrderItem.table_name} oi INNER JOIN #{Order.table_name} o ON o.id = oi.order_id ] end end
What were the total sales amounts
in California in Q1 2014
by product families?
Dwh::SalesFact. joins(:customer).joins(:product => :product_class).joins(:time). where("d_customers.country" => “USA", "d_customers.state_province" => "CA"). where("d_time.year" => 2014, "d_time.quarter" => 1). group("d_product_classes.product_family"). sum("sales_amount")
Multi-Dimensional Data Model
TimeProduct
Cus
tom
er MeasuresSales quantitySales amount
Sales costCustomers count
Sales cube
Dimension Hierarchies
All Customers
USA Canada
WA CA OR
San Francisco Los Angeles
Country
All
State
City
Levels
Time DimensionAll Times
2014 2015
Q2 Q3 Q4
AUG SEP
Year
All
Quarter
Month
AUG 01 AUG 02 Day
Q1
JUL
Defaulthierarchy
All Times
2014 2015
W2 W3 W4
JAN 18 JAN 19
Year
All
Week
Day
W1
JAN 17
Weeklyhierarchy
OLAP TechnologiesOn-Line Analytical Processing
Mondrianhttp://community.pentaho.com/projects/mondrian/
https://github.com/rsim/mondrian-olap
mondrian-olap gem
Mondrian::OLAP::Schema.define do cube 'Sales' do table 'f_sales', schema: 'dwh'
dimension 'Customer', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Customers', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Country', column: 'country' level 'State Province', column: 'state_province' level 'City', column: 'city' level 'Name', column: 'full_name' end end
dimension 'Product', foreign_key: 'product_id' do hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do join left_key: 'product_class_id', right_key: 'id' do table 'd_products', schema: 'dwh' table 'd_product_classes', schema: 'dwh' end level 'Product Family', table: 'd_product_classes', column: 'product_family' level 'Product Department', table: 'd_product_classes', column: 'product_department' level 'Product Category', table: 'd_product_classes', column: 'product_category' level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory' level 'Brand Name', table: 'd_products', column: 'brand_name' level 'Product Name', table: 'd_products', column: 'product_name' end end
dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do hierarchy all_member_name: 'All Time', primary_key: 'id' do table 'd_time', schema: 'dwh' level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears' level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters' level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths' level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays' end end
measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum' measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum' measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum' measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count' end end
mondrian-olap schema definition
What were the total sales amounts
in California in Q1 2014
by product families?
olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
MDX Query Languageolap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
SELECT {[Measures].[Sales Amount]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])
Results Caching
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (21713.0ms)
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (10.0ms)
Additional Attribute Dimensiondimension 'Gender', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Genders', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Gender', column: 'gender' do name_expression do sql "CASE d_customers.gender WHEN 'F' THEN ‘Female' WHEN 'M' THEN ‘Male' END" end end end end
olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Gender].[Gender].Members")
Dynamic Attribute Dimensiondimension 'Age interval', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Age', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Age interval' do key_expression do sql %[ CASE WHEN age(d_customers.birth_date) < interval '20 years' THEN '< 20 years' WHEN age(d_customers.birth_date) < interval '30 years' THEN '20-30 years' WHEN age(d_customers.birth_date) < interval '40 years' THEN '30-40 years' WHEN age(d_customers.birth_date) < interval '50 years' THEN '40-50 years' ELSE '50+ years' END ] end end end end
[Age interval].[<20 years] [Age interval].[20-30 years] [Age interval].[30-40 years] [Age interval].[40-50 years] [Age interval].[50+ years]
Calculation Formulascalculated_member 'Profit', dimension: 'Measures', format_string: '#,##0.00', formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]'
calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%', formula: '[Measures].[Profit] / [Measures].[Sales Amount]'
olap.from("Sales"). columns("[Measures].[Profit]", "[Measures].[Margin %]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
ETL process
DataWarehouse
Measures
Dimension1 Dimension2
Dimension4Dimension3
Database
REST API
Extract Transform Load
Kiba example# declare a ruby method here, for quick reusable logic def parse_french_date(date) Date.strptime(date, '%d/%m/%Y') end
# or better, include a ruby file which loads reusable assets # eg: commonly used sources / destinations / transforms, under unit-test require_relative 'common'
# declare a source where to take data from (you implement it - see notes below) source MyCsvSource, 'input.csv'
# declare a row transform to process a given field transform do |row| row[:birth_date] = parse_french_date(row[:birth_date]) # return to keep in the pipeline row end
# declare another row transform, dismissing rows conditionally by returning nil transform do |row| row[:birth_date].year < 2000 ? row : nil end
# declare a row transform as a class, which can be tested properly transform ComplianceCheckTransform, eula: 2015
# before declaring a definition, maybe you'll want to retrieve credentials
Multithreaded ETL
https://github.com/ruby-concurrency/concurrent-ruby
ExtractThreadPool
TransformThreadPool
LoadThreadPool
Data source
Extracted data
Transformed data
Pro-tip: Use
Single threaded
ETL
class Dwh::TimeDimension < Dwh::Dimension def self.load! logger.silence do connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date(date) end end end
def self.insert_date(date) year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end
require 'concurrent/executors'
class Dwh::TimeDimension < Dwh::Dimension
def self.parallel_load!(pool_size = 4) logger.silence do insert_date_pool = Concurrent::FixedThreadPool.new(pool_size)
connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date_pool.post(date) do |date| connection_pool.with_connection do insert_date(date) end end end
insert_date_pool.shutdown insert_date_pool.wait_for_termination end end
end
ETL with Thread Pool
Benchmark!
Dwh::TimeDimension.load! (5236.0ms) Dwh::TimeDimension.parallel_load!(2) (3450.0ms) Dwh::TimeDimension.parallel_load!(4) (2142.0ms) Dwh::TimeDimension.parallel_load!(6) (2361.0ms) Dwh::TimeDimension.parallel_load!(8) (2826.0ms)
optimal sizein this case
Java Mission Control
Traditional vs Analytical Relational Databases
Optimized fortransaction processing
Optimized foranalytical queries
Columnar Storage
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
Analytical Query PerformanceSELECT d_product_classes.product_family, SUM(f_sales.sales_amount) AS sales_amount, SUM(f_sales.sales_cost) AS sales_cost, COUNT(DISTINCT f_sales.customer_id) AS customers_count FROM "dwh"."f_sales" INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" = "dwh"."f_sales"."product_id" INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" = "dwh"."d_products"."product_class_id" GROUP BY d_product_classes.product_family
always ~18 seconds
first ~9 seconds next ~1.5 seconds
6 million rows
When to use what?
Fact table sizeTraditional
transactional databases
Analytical columnar databases
< 1M rows OK No big win
1-10M rows Complex queries slower OK
10-100M rows Slow OK
>100M rows Very slow OK with tuning
What did we cover?
Problems with analytical queriesDimensional modeling
Star schemasMondrian OLAP and MDX
ETL – Extract, Transform, LoadAnalytical columnar databases