The final frontier v3

Agile Data Warehouse The Final Frontier

@tbunio

bornagainagilist.wordpress.com

www.protegra.com

Terry Bunio

Who Am I?

• Data Base Administrator

– Oracle, SQL Server, ADABAS

• Data Architect

– Investors Group, LPL Financial, Manitoba

Blue Cross, Assante Financial

• Agilist

– Innovation Gamer, Team Member, Project

Manager, PMO on SAP Implementation

Learning Objectives

• Learn about how a Data Warehouse Project

can be Agile

• Introduce Agile practices that can help to be

DWAgile

• Introduce DW practices that can help to be

DWAgile

What is Agile?

• Deliver frequently as possible

• Minimize Inventory

– All work that doesn’t directly contribute to

delivering value to the client

– Typically value is realized by code

Enterpise Models

Spock Method

Visualization

Spectre of the Agility

Database/Data Warehouse Architecture

DWAgile Practices

Data Warehouse

• Definition

– “a database used for reporting and data analysis.

It is a central repository of data which is created

by integrating data from multiple disparate

sources. Data warehouses store current as well

as historical data and are commonly used for

creating trending reports for senior management

reporting such as annual and quarterly

comparisons.” – Wikipedia.org

Data Warehouse

• Can refer to:

– Reporting Databases

– Operational Data Stores

– Data Marts

– Enterprise Data Warehouse

– Cubes

– Excel?

– Others

Two sides of Database Design

Two design methods

• Relational – “Database normalization is the process of organizing

the fields and tables of a relational database to

minimize redundancy and dependency. Normalization

usually involves dividing large tables into smaller (and less

redundant) tables and defining relationships between them.

The objective is to isolate data so that additions, deletions,

and modifications of a field can be made in just one table

and then propagated through the rest of the database via

the defined relationships.”.”

http://en.wikipedia.org/wiki/Field_(computer_science)

http://en.wikipedia.org/wiki/Table_(database)

http://en.wikipedia.org/wiki/Relational_database

http://en.wikipedia.org/wiki/Data_redundancy

Two design methods

• Dimensional – “Dimensional modeling always uses the concepts of facts

(measures), and dimensions (context). Facts are typically

(but not always) numeric values that can be aggregated,

and dimensions are groups of hierarchies and descriptors

that define the facts

Relational

• Relational Analysis

– Database design is usually in Third Normal

Form

– Database is optimized for transaction

processing. (OLTP)

– Normalized tables are optimized for

modification rather than retrieval

Normal forms

• 1st - Under first normal form, all occurrences of a

record type must contain the same number of fields.

• 2nd - Second normal form is violated when a non-

key field is a fact about a subset of a key. It is only

relevant when the key is composite

• 3rd - Third normal form is violated when a non-key

field is a fact about another non-key field

Source: William Kent - 1982

Dimensional

• Dimensional Analysis

– Star Schema/Snowflake

– Database is optimized for analytical

processing. (OLAP)

– Facts and Dimensions optimized for retrieval

• Facts – Business events – Transactions

• Dimensions – context for Transactions

– Accounts

– Products

– Date

Relational

Dimensional

Kimball-lytes

• Bottom-up - incremental

– Operational systems feed the Data

Warehouse

– Data Warehouse is a corporate dimensional

model that Data Marts are sourced from

– Data Warehouse is the consolidation of Data

Marts

– Sometimes the Data Warehouse is generated

from Subject area Data Marts

Inmon-ians

• Top-down

– Corporate Information Factory

– Operational systems feed the Data

Warehouse

– Enterprise Data Warehouse is a corporate

relational model that Data Marts are sourced

from

– Enterprise Data Warehouse is the source of

Data Marts

The gist…

• Kimball’s approach is easier to implement as

you are dealing with separate subject areas,

but can be a nightmare to integrate

• Inmon’s approach has more upfront effort to

avoid these consistency problems, but takes

longer to implement.

Spectre of the Agility

Incremental - Kimball

•In Segments •Detailed Analysis

•Development •Deploy •Long Feedback loop

•Considerable changes •Rework •Defects

Waterfall - Inmon •Detailed Analysis •Large Development

•Large Deploy •Long Feedback loop •Extensive changes

•Many Defects

Data Warehouse

Project

Popular Agile Data Warehouse Pattern

• Son’a method

– Analyze data requirements department by

department

– Create Reports and Facts and Dimensions for

each

– Integrate when you do subsequent

departments

The two problems

• Conforming Dimensions

– A Dimension conforms when it is in

equivalent structure and content

– Is a client defined by Marketing the same as

Finance?

• Probably not

– If the Dimensions do not conform, this

severely hampers the Data Warehouse

The two problems

• Modeling the use of the data versus the data

– By using reporting needs as the primary

foundation for the data model, you are modeling

the use of the data rather than the data

– This will cause more rework in the future as the

use of the data is more likely to change than the

data itself.

Where is she?

Where is the true Agility?

• Iterations not Increments

• Brutal Visibility/Visualization

• Short Feedback loops

• Just enough requirements

• Working on enterprise priorities – not just for

an individual department

Fact

• True iterative development on a Data

Warehouse project is hard – perhaps harder

than a traditional Software Development

project

– ETL, Data Models, and Business Intelligence

stories can have a high impact on other

stories

– Can be difficult to create independent stories

– Stories can have many prerequisites

Fiction

• True iterative development on a Data

Warehouse project is impossible

– ETL, Data Models, and Business Intelligence

stories can be developed iteratively

– Independent stories can be developed

– Stories can have many prerequisites – but

this can be limited

Agile Mindset

• We need to implement an Agile Mindset to

Data Modelling

– What is just enough Data Modelling?

– And do no more…

Our Mission

• “Data... the Final Frontier. These are the

continuing voyages of the starship Agile.

Her on-going mission: to explore strange

new projects, to seek out new value and

new clients, to iteratively go where no

projects have gone before.”

The Prime Directive

The Prime Directive

• Is a vision or philosophy that binds the

actions of Starfleet

• Can an Data Warehouse project truly be

Agile without a Vision of either the Business

Domain or Data Domain?

– Essentially it is then just an Ad Hoc Data

Warehouse. Separate components that may fit

together.

– How do we ensure we are working on the right

priorities for the entire enterprise?

Enterprise Data Model?

Torture

• Why does the creation of Enterprise Data

Models feel like torture?

– Interrogation

– Coercion

– Agreement on Excessive detail without direct

alignment to business value

Enterprise Models

Enterprise Models

Two new models

Agile Enterprise Normalized Data Model

• Confirms the major entities and the

relationships between them

– 30-50 entities

• Confirms the Data Domain

• Starts the definition of a Normalized Data

Model that will be refined over time

– Completed in 1 – 4 weeks

Agile Enterprise Normalized Data Model

• Is just enough to understand the data

domain so that the iterations can proceed

• Is not mapping all the attributes

– Is not BDUF

• Is an Information Map for the Data Domain

• Contains placeholders for refinement

– Like a User Story Map

Agile Enterprise Dimensional Data Model

• Confirms the Business Objects and the

relationships between them

– 10-15 entities

• Confirms the Business Domains

• Starts the definition of a Dimensional Data

Model that will be refined over time

– Completed in 1 – 2 weeks

Agile Enterprise Dimensional Data Model

• Is just enough to understand the business

domain so that the iterations can proceed

– And to validate the understanding of the data

domain

• Is not mapping all the attributes

– Is not BDUF

• Is an Information Map for the Business Domain

• Contains placeholders for refinement

– Like a User Story Map

Agile Information Maps


• Agile Information Maps allow for:

– Efficient Navigation of the Data and Business

Domains

– Ability to set up ‘Neutral Zones’ for areas that

need more negotiation

– Visual communication of the topology of the

Data and Business Domains

• Easier and more accurate to validate than text

• ‘feels right’


• Are

– Our vision

– Our Maps for the Data and Business Domains

– A guide for our solution

– Minimizes rework and refactoring

– Our Prime Directive

– Data Models

Kimball or Inmon?

Spock

• Hybrid approach

– It is only logical

– Needs of the many outweigh the needs of the

few – or the one

Spock Approach

Agile Normalized

Data Model

DM

DM

DM

ODS

DW Agile Dimensional

Data Model

Business

Domain

Spike

Spock Approach

• Business Domain Spike

• Agile Information Maps

– Agile Enterprise Normalized Data Model

– Agile Enterprise Dimensional Data Model

• Implement

– Operational Data Store

– Dimensional Data Warehouse

• Reporting can then be done from either

Business Domain Spike

• Needs to precede work on Agile Information

Maps

• Need to understand the business and

industry before you can create Data of

Business Information Maps

• Can take 1-2 weeks for an initial

understanding

– Constant refinement

Benefits of Spock Approach

• Agile Enterprise Normalized Data Model

– Validates knowledge of Data Domain

– Ensure later increments don’t uncover data

that was previously unknown and hard to

integrate

• Minimizes rework and refactoring

– True iterations

• Confirm at high level and then refine


• Agile Enterprise Dimensional Data Model

– Validates knowledge of Business Domain

– The process of ‘cooking down’ to a

Dimensional Model validates design and

identifies areas of inconsistencies or errors

• Especially true when you need to design how

changes and history will be handled

– True iterations

• Confirm at high level and then refine


• Operational Data Store

– Model data relationally to provide enterprise

level operational reports

– Consolidate and cleanse data before it is

visible to end-users

– Is used to refine the Agile Enterprise

Normalized Data Model

– Start creating reports to validate data model

immediately!


• Dimensional Data Warehouse

– Model data dimensionally to provide

enterprise level analytical reports

– Provide full historical data and context for

reports

– Is used to refine the Agile Enterprise

Dimensional Data Model

– Clients can start creating reports to validate

data model immediately!

Do we need an ODS and DW?

• Relational Analysis provides

– Validation of the Data domain

• Dimensional Analysis provides

– Validation of the Business domain

– Additional level of confirmation of the Data

domain as the relational model in translated

into a dimensional one

• Much easier for inconsistencies and errors to

hide in 300+ tables as opposed to 30+

Most Importantly..

• Operational Data Store

– Minimal Data Latency

– Current state

– Allow for efficient Operational Reporting

• Data Warehouse

– Moderate Data Latency

– Full history

– Allows for efficient Analytical Reporting

Agile Approach

• With an Agile approach you can deliver just

enough of an Operational Data Store or Data

Warehouse based on needs

– No longer do they need to be a huge deliverable

• Neither presumes a complete implementation

is required

• The Information Models allow for iterative

delivery of value

How do we work iteratively on

a Data Warehouse?

Increments versus iterations

• Increments

– Series by series – department by department

• Iterations

– Story by story – episode by episode

• Enterprise prioritization

– Work on the highest priority for the enterprise

– Not just within each series/department

Iterative Focus

• Instead of focusing on trying to have a

complete model, we focused on creating

processes that allow us to deliver changes

within 30 minutes from model to deployment

Captain, we need more Visualization!

The View Screen

The View Screen

• Enabled bridge to bridge communications

• Provided visual images in and around the

ship

– From different angles

– How did that work?

• Allowed for more understanding of the

situation

Visualization

Visualization

• Is required to:

– Report Project status

– Provide a visual report map

Kanban Board

• We used a standard Kanban board to track

stories as we worked on them

– These stories resulted in ETL, Data Model,

and Reporting tasks

– We had a Data Model/ETL board and a Report

board

– ETL and Data Model required a foundation

created by the Information Maps before we

could start on stories

• We also used thermometer imagery to report

how we were progressing according to the

schedule

– Milestones were on the thermometer along

with the number of reports that we had

completed every day

Report Visualization

Cardassian Union

Be careful how you spell that…

Data Modeling Union

• For too long the Data Modellers have not

been integrated with Software Developers

• Data Modellers have been like the

Cardassian Union, not integrated with the

Federation

Issues

• This has led to:

– Holy wars

– Each side expecting the other to follow their

schedule

– Lack of communication and collaboration

• Data Modellers need to join the ‘United

Federation of Projects’

How did we be Agile?

Tools of the trade

Tools of the Trade

• Version Control and Refactoring

• Test Automation

• Communication and Governance

• Adaptability and Change Tolerance

• Assimilation

Version Control

Version Control

• If you don’t control versions, they will control

you

• Data Models must become integrated with

the source control of the project

– In the same repository of project trunk and

branches

• You can’t just version a series of SQL files

separate from your data model

Our Version Experience

• We are using Subversion

• We are using Oracle Data Modeler as our

Modeling tool.

– It has very good integration with Subversion

– Our DBMS is SQL Server 2012

• Unlike other modeling tools, the data model

was able to be integrated in Subversion with

the rest of the project

ODM Shameless plug

• Free

• Subversion Integration

• Supports Logical and Relational data models

• Since it is free, the data models can be

shared and refined by all members of the

development team

• Currently on version 2685

How do we roll out versions?

• Create Data Model changes

• Use Red Gate SQL Compare to generate

alter script

– Generate a new DB and compare to the last

version to generate alter script

• 95% of changes deployed in less than 10

minutes

How do we roll out versions?

• We build on the Farley and Humble Blue-

Green Deployment model

– Blue – Current Version and Revision – Database

Name will be ‘ODS’

– Green – 1 Revision Old – Database Name will be

‘ODS-GREEN’

– Brown – 1 Major Version Old – Database Name

will be ‘ODS-BROWN’

Versioning

• SQL Change scripts are generated all

changes

• A full script is generated for every major

version

– A new folder is created for every major

version

– Major version folders and named after the

greek alphabet. (alpha, beta, gamma)

SQL Script version naming standards • [revision number]-[ODS/DW]-[I/A][version number]-

[subversion revision number of corresponding Data

model].sql

– Revision number – auto-incrementing

– Version Number – A999

• Alphabetic character represents major version – corresponds

with folder named after greek alphabet

• 999 indicates minor versions

– subversion revision number of corresponding Data model – allows

for a exact synchronization between Data Model and SQL Scripts

• All objects are stored within one Subversion repository

– They all share the same revision numbering

SQL Script version naming standards

• For example:

– 0-ODS-I-A001-745.sql – initial db and table

creation for current ODS version (includes

reference data)

– 1-ODS-A-A001-1574.sql – revision 1 ODS alter

script that corresponds to data model subversion

revision 1574

– 2-ODS-A-A001-1590.sql - revision 2 ODS alter

script that corresponds to data model subversion

revision 1590

SQL Script error handling

• Validation is done to prevent

– Scripts being run out of sequence

– Revision being applied without addressing

required refactoring

– Scripts being run on any environment but the

Blue environment

But what about Refactoring?

• Having Agile Information Maps has

significantly reduced refactoring

– This was an entirely new data domain for the

team

• Using the Blue-Green-Brown deployment

model has simplified required refactoring

• We have used the methods described by

Scott Ambler on the odd occasion

Good Start

Create the plan for how you

will re-factor

Refactoring Experience

• We haven’t needed to refactor much

• Since are iteratively refining we haven’t had

to re-define much

– Just adding more detail

– Main Information Maps have held together

Test Automation

Test Automation

• Enterprise was saved due to constantly

running tests on the warp engine

• Allowed for quick decision making

Automated Test Suite

• Leveraged the tSQLt Open Source

Framework

• Purchased SQL-test from Red-Gate for a

enhanced interface

• Enhanced the framework to execute tests

from four custom tables we defined

Automated Test Suite

• Leveraged Data Mapping spreadsheet that

the automated tests used

– Two database tables were loaded from the

spreadsheet

– Two additional tables contained ETL test

cases

– 13 Stored Procedures executed the tests

– 3300+ columns mapped!

Table Tests • TstTableCount: Compares record counts between source

data and target data.

• TstTableColumnDistinct: Compares counts on distinct values

of columns.

• TstTableColumnNull: Generates a report of all columns

where all the contents of a field is all null.

Column Tests • TstColumnDataMapping: Compares columns directly

assigned from a source column on a field by field basis for 5-10

rows in the target table.

• TstColumnConstantMapping: Compares columns assigned a

constant on a field by field basis for 5-10 rows in the target

table.

• TstColumnNullMapping: Compares columns assigned a Null

value on a field by field basis for 5-10 rows in the target table.

• TstColumnTransformedMapping: Compares transformed

columns on a field by field basis for 5-10 rows in the target

table.

Data Quality Tests • TstInvalidParentFKColumn: Tests that an Invalid Parent FK

value results in the records being logged and bypassed. This

record will be added to the staging table to test the process.

• TstInvalidFKColumn: Tests that an Invalid FK value results in

the value being assigned a default value or Null. This record

will be added to the staging table to test the process.

• TstInvalidColumn: Tests that an Invalid value results in the

value being assigned a default value or Null. This record will be

added to the staging table to test the process.

Process Integrity Tests • TstRestartTask: Tests that a Task can be started from the

start and subsequent steps will run in sequence.

• TstRecoverTask: Tests that a Task can be re-started in the

middle and that record are processed correctly and subsequent

steps will run in sequence.

Interested?

• Leave me a business card and I’ll send you

the design document and stored procedures

Communication

Team Communication

• Frequent Data Model walkthroughs with

application teams

• Full access to the Data model through the

Data Modeling development tool

• Data Models posted in every room for

developers to mark up with suggestions

• Database deployment to play with for every

release

Client Communication

• Frequent Conceptual Data Model

walkthroughs with clients

– Includes presentation of scenarios with data

to confirm and validate understanding

• Collaboration on the iterative plan to ensure

they agree on the process and support it

Monthly Governance Meeting – Visual Kan Ban boards reviewed

– Reports developed in the prior iterations were

demonstrated

– Business Areas were asked to submit a ranked

list of their top 10-20 data requirement/reports for

the next iteration.

Adaptability

Be Nimble

• Already discussed how we can roll out new

versions quickly

Change Tolerant Data Model

• Only add tables and columns when they are

absolutely required

• Leverage Data Domains so that attributes

are created consistently and can be changed

in unison

– Use limited number of standard domains


• Data Model needs to be loosely coupled and

have high cohesion

– Need to model the data and business and not

the applications or reports!


• Don’t model the data according to the

application’s Object Model

• Don’t model the data according to source

systems

• These items will change more frequently

than the actual data structure

• Your Data Model and Object Model should

be different!

Assimilate

Assimilate

• Assimilate Version Control, Communication,

Adaptability, Refinement, and Re-Factoring

into core project activities

– Stand ups

– Continuous Integration

– Check outs and Check Ins

• Make them part of the standard activities –

not something on the side

Our experience

Our Mission

• These practices and methods are being

used to redevelop an entire Business

Intelligence platform for a major ‘Blue’ Health

Benefits company

– Operational and Analytical Reports

• 100+ integration projects

• SAP Claims solution

Our Mission

• Integration projects are being run Agile

• 100+ team members across all projects

• SAP project is being run more in a more

traditional manner

– ‘big-bang’ SAP implementation

• I’m now also fulfilling the role of an Agile PMO

Our Challenge

• How can we deploy to production early and

often when the system is a ‘big-bang’

implementation

– We were ready to deploy ahead of clients and

other projects

– We were dependant on other conversion

projects

Our Challenge

• We are now exploring alternate ways to

deploy to production before the ‘big-bang’

implementation

– To allow the clients to use the reports and

iteratively refine them and the solution

– Also allows our team to validate data integrity

and quality iteratively

– We are now executing iterations to make this

possible

Our BI Solution

• SQL Server 2012

– Integration Services

– Reporting Services

• SharePoint 2010 Foundation

– SharePoint Integrated Reporting Solution

Our team

• Integrated team of

– 2 enterprise DBAs from the ‘Blue’

– 5 Data Analysts/DBAs/SSIS/SSRS developers

• Governance team comprised of

– Business Areas

– Systems Areas

– Stakeholders

Current Stardate

• We have completed the initial ODS and DW

development – including ETL

• We have completed a significant revision of

ODS, DW, and ETL – without major issues

• We are now finishing Report development –

reports have required database changes and

ETL changes – but no major changes!

– 300+ reports developed

Summary

• Use Agile Enterprise Data Models to provide

the initial vision and allow for refinements

• Strive for Iterations over Increments

• Align governance and prioritization with

iterations

• Plan and Integrate processes for Versioning,

Test Automation, Adaptability, Refinement

What doesn’t change?

Leadership

Leadership

• “If you want to build a ship, don't drum up

people together to collect wood and don't

assign them tasks and work, but rather teach

them to long for the endless immensity of the

sea.” ~ Antoine de Saint-Exupery

Leadership • “[A goalie's] job is to stop pucks, ... Well, yeah, that's

part of it. But you know what else it is? ... You're

trying to deliver a message to your team that things

are OK back here. This end of the ice is pretty well

cared for. You take it now and go. Go! Feel the

freedom you need in order to be that dynamic,

creative, offensive player and go out and score. ...

That was my job. And it was to try to deliver a

feeling.” ~ Ken Dryden

Three awesome books

Documents

The final frontier v3