44
www.enterprisedb.com The Thinking Person’s Guide to Data Warehouse Design Robin Schumacher Director of Product Strategy www.enterprisedb.com www.enterprisedb.com

The Thinking Person’s Guide to Data Warehouse Design

  • Upload
    fayola

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

The Thinking Person’s Guide to Data Warehouse Design. Robin Schumacher Director of Product Strategy www.enterprisedb.com. www.enterprisedb.com. Building a Logical Design Transitioning to the Physical Monitoring for Success. www.enterprisedb.com. Building a Logical Design. - PowerPoint PPT Presentation

Citation preview

Page 1: The Thinking Person’s Guide to Data Warehouse Design

The Thinking Person’s Guide to Data Warehouse

Design

Robin SchumacherDirector of Product Strategy

www.enterprisedb.com

www.enterprisedb.com

Page 2: The Thinking Person’s Guide to Data Warehouse Design

Building a Logical DesignTransitioning to the PhysicalMonitoring for Success

www.enterprisedb.com

Page 3: The Thinking Person’s Guide to Data Warehouse Design

Building a Logical Design

www.enterprisedb.com

Page 4: The Thinking Person’s Guide to Data Warehouse Design

Why Care About

Design…?

www.enterprisedb.com

Page 5: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

What is the key component for success?

In other words, what you do with your PostgreSQL Server– in terms of physical design, schema design, and

performance design – will be the biggest factor on whether a BI system hits the mark…

* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.

*

Page 6: The Thinking Person’s Guide to Data Warehouse Design

The #1 Cause of Database

Downtime…?

Bad Design…

www.enterprisedb.com

Source: Oracle Corporation

Your Database

Page 7: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

First Recommendation – Use a Modeling Tool

*

Page 8: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

A logical design for OLTP Isn’t For Data Warehouse

Page 9: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Simple reporting databases

OLTP Database Read Shard OneReporting Database

Application Servers

End Users

ETL

Just use the same design on a different box…

Replication

Page 10: The Thinking Person’s Guide to Data Warehouse Design

Data Warehouse Horror Story

Number One…

www.enterprisedb.com

Page 11: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

The logical design for analytics/data warehousing

Page 12: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

• Datatypes are more generally defined, not directed toward a database engine. Still choose carefully

• Entities aren’t designed for performance necessarily• Redundancy is avoided, but simplicity is still a goal• Bottom line: you want to make sure your data is

correctly represented and is easily understood (new class of user today)

Page 13: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Manual horizontal partitioningModeling technique to overcome large data volumes

Page 14: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Manual Vertical Partitioning

Page 15: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Pro’s/con’s to manual partitioning

• More tables to manage• More referential integrity to

manage• More indexes to manage• Joins oftentimes needed to

accomplish query requests• Oftentimes, a redesign is needed

because the rows / columns you thought you’d be accessing together change; it’s hard to predict ad-hoc query traffic

• Less I/O if design holds up• Easy to prune obsolete data• Possibly less object contention

Pro’s

Con’s

Page 16: The Thinking Person’s Guide to Data Warehouse Design

Use a modeling toolDon’t use 3rd normal formManual partition but…Let the DB do the heavy lifting

www.enterprisedb.com

Page 17: The Thinking Person’s Guide to Data Warehouse Design

Transitioning to a Physical Design

www.enterprisedb.com

Page 18: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

SQL or NoSQL…?

Row or Column database…?

How to scale…?

Should I worry about High availability…?

Index or no…?

How should I partition my data…?

Is sharding a good idea…?

Page 19: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

What technologies you should be looking at

* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.

*

Page 20: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

• Whether you choose to go NoSQL, Shard, use MPP databases, or something similar, divide & conquer is your best friend

• You can scale-up and divide & conquer to a point, but you will hit disk, memory, or other limitations

• Scaling up and out is the best future proof methodology

Page 21: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Divide & Conquer via Sharding

Page 22: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Row or Column-Based Database?A column-oriented architecture looks the same on the surface, but stores data differently than legacy/row-based databases…

Page 23: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Yes, Row-based database! Yes, Column-based database!• Will need most columns in a table for query

• Only need subset of columns for query

• Will be doing lots of single inserts/deletes • Need very fast loads; little DML

• Small-medium data • Medium-very large data

• Know exactly what to index; won’t change • Very dynamic; query patterns change

Row or Column-Based Database…?

Page 24: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Example: Column DB vs. “Leading” row DBInfiniDB takes up 22% less space InfiniDB loaded data 22% faster

InfiniDB total query times were 65% less InfiniDB average query times were 59% less

Notice not only are

the queries faster, but also more

predictable

* Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data

Page 25: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Hybrid Row / Column Databases

Some vendors now give you a choice of table format – row or column – based on expected access patterns.

Page 26: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

What about NoSQL options?

• Standard model is not relational• Typically don’t use SQL to access the data• Take up more space than column databases• Lack special optimizers / features to reduce I/O

Page 27: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

What about NoSQL options?

• Really are row-oriented architectures that store data in ‘column families, which are expected to be accessed together (remember logical vertical partitioning?) Individual columns cannot be accessed independently

• Will be faster with individual insert and delete operations• Will normally be faster with single row requests• Will lag in typical analytic / data warehouse use cases

Page 28: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

PostgreSQL Specific - Partitioning

• Main goal: reduce I/O via partitioning• Partitioning in PostgreSQL is somewhat more

laborious than other RDBMS’s• Consider when table size exceeds memory capacity• Partitioning key is ‘key’ for many reasons• Have seen > 90% response time reductions when

done right • Partitioning also assists in more efficient data

pruning activities than typical DELETE operations

Page 29: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

PostgreSQL Specific - GridSQL

• One option for divide-and-conquer strategy• Does have limitations with respect to PostgreSQL

feature and syntax support• One customer of EnterpriseDB is running well with

8TB

Page 30: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

What About Indexing? • If query patterns are known and predictable, and data is

relatively static, then indexing isn’t that difficult• If the situation is a very ad-hoc environment, indexing

becomes more difficult. Must analyze SQL traffic and index the best you can

• Over-indexing a table that is frequently loaded / refreshed / updated can severely impact load and DML performance. Test dropping and re-creating indexes vs. doing in-place loads and DML. Realize, though, any queries will be impacted from dropped indexes

• Remember that a benefit of (most) column databases is that they do not need or use indexes

Page 31: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Optimizing for Data Loads

• The two biggest killers of load performance are (1) very wide tables for row-based tables; (2) many indexes on a table / foreign keys;

• Column-based tables typically load faster than row-based tables with load utilities, however they will experience slower insert/delete rates than row-based tables

• Move the data as close to the database as possible; avoid having applications on remote machines do data manipulations and send data across the wire a row at a time – perhaps the worst way to load data

• Oftentimes good to create staging tables then use procedural language to do data modifications and/or create flat files for high speed loaders

• PostgreSQL COPY much faster than INSERT; EnterpriseDB’s EDB*Loader faster than COPY

Page 32: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Optimizing for Data Loads

• Increasing maintenance_work_mem to a large values (e.g. > 1GB) helps speed index and foreign key constraint creations

• Turning autovacuum off can help speed load operations • Turning off synchronous commit (synchronous_commit )can

help improve load efficiency, but utilize generally only for all-or-nothing use cases

• Minimizing checkpoint I/O is a good idea (checkpoint_segments to 100-200 and checkpoint_timeout to a higher value such as 1 hour or so).

Page 33: The Thinking Person’s Guide to Data Warehouse Design

Monitoring and Tuning the Design

www.enterprisedb.com

Page 34: The Thinking Person’s Guide to Data Warehouse Design

1. Bottleneck Analysis

2. Workload Analysis

3. Ratio Analysis

www.enterprisedb.com

Page 35: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Bottleneck Analysis

• The focus of this methodology is the answer to the question “what am I waiting on?”

• With general PostgreSQL, unfortunately, it can be difficult to determine latency in the database server

• Lock contention rarely an issue in data warehouses• Can use EnterpriseDB’s wait interface• Problems found in bottleneck analysis translate into better

lock handling in the app, partitioning improvements, better indexing, or storage engine replacement

Page 36: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Workload Analysis

• The focus of this methodology is the answer to three questions: (1) Who’s logged on?; (2) What are they doing?; (3) How is my machine handing it?

• Monitor active and inactive sessions. Keep in mind idle connections do take up resources

• I/O and ‘hot objects’ a key area of analysis• Key focus should be on SQL statement monitoring and

collection; something that goes beyond standard pre-production EXPLAIN analysis

Page 37: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

The Pain of Slow SQL

* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.

Page 38: The Thinking Person’s Guide to Data Warehouse Design

Data Warehouse Horror Story

Number Two…

www.enterprisedb.com

Page 39: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Workload Analysis

• SQL analysis basically becomes bottleneck analysis, because you’re asking where your SQL statement is spending its time

• Once you have collected and identified your ‘top SQL’, the next step is to do tracing and interrogation into each SQL statement to understand its execution

• Historical analysis is important too; a query that ran fine with 5 million rows may tank with 50 million or with more concurrent users

• Design changes usually involve data file striping, indexing, partitioning, or parallel processing additions

Page 40: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Workload Analysis

• The pgstatspack utility available to the community can provide some statistics for workload analysis

• The pgfouine log analyzer tool can help you identify bad SQL• EnterpriseDB packages a utility that duplicates Oracle Automatic

Workload Repository (AWR) reports, and shows hot objects, top wait events, and much more

• A new SQL Profiler utility will be available from EnterpriseDB in the first half of 2011 that will help in tracing and analyzing SQL statement execution

Page 41: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Ratio Analysis

• Least useful of all the performance analysis methods• May be OK to get a general rule of thumb as to how various

resources are being used• Do not be misled by ratios; for example, a high cache hit ratio

is sometimes meaningless. Databases can be brought to their knees by excessive logical I/O

• Design changes from ratios typically include the altering of configuration parameters and sometimes indexing

Page 42: The Thinking Person’s Guide to Data Warehouse Design

Conclusions

www.enterprisedb.com

Page 43: The Thinking Person’s Guide to Data Warehouse Design

www.enterprisedb.com

Ratio Analysis

• Design is the #1 contributor to the overall performance and availability of a system

• With PostgreSQL, you have greater flexibility and opportunity than ever before to build well-designed data warehouses

• With PostgreSQL, you now have more options and features available than ever before

• The above translates into you being able to design data warehouses that can be future proofed: they can run as fast as you’d like (hopefully) and store as much data as you need (ditto)

Page 44: The Thinking Person’s Guide to Data Warehouse Design

The Thinking Person’s Guide to Data Warehouse

Design

Thanks…!

www.enterprisedb.com