Upload
terry-bunio
View
124
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Agile Data Warehouse The Final Frontier
@tbunio
bornagainagilist.wordpress.com
www.protegra.com
Terry Bunio
Who Am I?
• Data Base Administrator
– Oracle, SQL Server, ADABAS
• Data Architect
– Investors Group, LPL Financial, Manitoba
Blue Cross, Assante Financial
• Agilist
– Innovation Gamer, Team Member, Project
Manager, PMO on SAP Implementation
Learning Objectives
• Learn about how a Data Warehouse Project
can be Agile
• Introduce Agile practices that can help to be
DWAgile
• Introduce DW practices that can help to be
DWAgile
What is Agile?
• Deliver frequently as possible
• Minimize Inventory
– All work that doesn’t directly contribute to
delivering value to the client
– Typically value is realized by code
Enterpise Models
Spock Method
Visualization
Spectre of the Agility
Database/Data Warehouse Architecture
DWAgile Practices
Data Warehouse
• Definition
– “a database used for reporting and data analysis.
It is a central repository of data which is created
by integrating data from multiple disparate
sources. Data warehouses store current as well
as historical data and are commonly used for
creating trending reports for senior management
reporting such as annual and quarterly
comparisons.” – Wikipedia.org
Data Warehouse
• Can refer to:
– Reporting Databases
– Operational Data Stores
– Data Marts
– Enterprise Data Warehouse
– Cubes
– Excel?
– Others
Two sides of Database Design
Two design methods
• Relational – “Database normalization is the process of organizing
the fields and tables of a relational database to
minimize redundancy and dependency. Normalization
usually involves dividing large tables into smaller (and less
redundant) tables and defining relationships between them.
The objective is to isolate data so that additions, deletions,
and modifications of a field can be made in just one table
and then propagated through the rest of the database via
the defined relationships.”.”
Two design methods
• Dimensional – “Dimensional modeling always uses the concepts of facts
(measures), and dimensions (context). Facts are typically
(but not always) numeric values that can be aggregated,
and dimensions are groups of hierarchies and descriptors
that define the facts
Relational
• Relational Analysis
– Database design is usually in Third Normal
Form
– Database is optimized for transaction
processing. (OLTP)
– Normalized tables are optimized for
modification rather than retrieval
Normal forms
• 1st - Under first normal form, all occurrences of a
record type must contain the same number of fields.
• 2nd - Second normal form is violated when a non-
key field is a fact about a subset of a key. It is only
relevant when the key is composite
• 3rd - Third normal form is violated when a non-key
field is a fact about another non-key field
Source: William Kent - 1982
Dimensional
• Dimensional Analysis
– Star Schema/Snowflake
– Database is optimized for analytical
processing. (OLAP)
– Facts and Dimensions optimized for retrieval
• Facts – Business events – Transactions
• Dimensions – context for Transactions
– Accounts
– Products
– Date
Relational
Dimensional
Kimball-lytes
• Bottom-up - incremental
– Operational systems feed the Data
Warehouse
– Data Warehouse is a corporate dimensional
model that Data Marts are sourced from
– Data Warehouse is the consolidation of Data
Marts
– Sometimes the Data Warehouse is generated
from Subject area Data Marts
Inmon-ians
• Top-down
– Corporate Information Factory
– Operational systems feed the Data
Warehouse
– Enterprise Data Warehouse is a corporate
relational model that Data Marts are sourced
from
– Enterprise Data Warehouse is the source of
Data Marts
The gist…
• Kimball’s approach is easier to implement as
you are dealing with separate subject areas,
but can be a nightmare to integrate
• Inmon’s approach has more upfront effort to
avoid these consistency problems, but takes
longer to implement.
Spectre of the Agility
Incremental - Kimball
•In Segments •Detailed Analysis
•Development •Deploy •Long Feedback loop
•Considerable changes •Rework •Defects
Waterfall - Inmon •Detailed Analysis •Large Development
•Large Deploy •Long Feedback loop •Extensive changes
•Many Defects
Data Warehouse
Project
Popular Agile Data Warehouse Pattern
• Son’a method
– Analyze data requirements department by
department
– Create Reports and Facts and Dimensions for
each
– Integrate when you do subsequent
departments
The two problems
• Conforming Dimensions
– A Dimension conforms when it is in
equivalent structure and content
– Is a client defined by Marketing the same as
Finance?
• Probably not
– If the Dimensions do not conform, this
severely hampers the Data Warehouse
The two problems
• Modeling the use of the data versus the data
– By using reporting needs as the primary
foundation for the data model, you are modeling
the use of the data rather than the data
– This will cause more rework in the future as the
use of the data is more likely to change than the
data itself.
Where is she?
Where is the true Agility?
• Iterations not Increments
• Brutal Visibility/Visualization
• Short Feedback loops
• Just enough requirements
• Working on enterprise priorities – not just for
an individual department
Fact
• True iterative development on a Data
Warehouse project is hard – perhaps harder
than a traditional Software Development
project
– ETL, Data Models, and Business Intelligence
stories can have a high impact on other
stories
– Can be difficult to create independent stories
– Stories can have many prerequisites
Fiction
• True iterative development on a Data
Warehouse project is impossible
– ETL, Data Models, and Business Intelligence
stories can be developed iteratively
– Independent stories can be developed
– Stories can have many prerequisites – but
this can be limited
Agile Mindset
• We need to implement an Agile Mindset to
Data Modelling
– What is just enough Data Modelling?
– And do no more…
Our Mission
• “Data... the Final Frontier. These are the
continuing voyages of the starship Agile.
Her on-going mission: to explore strange
new projects, to seek out new value and
new clients, to iteratively go where no
projects have gone before.”
The Prime Directive
The Prime Directive
• Is a vision or philosophy that binds the
actions of Starfleet
• Can an Data Warehouse project truly be
Agile without a Vision of either the Business
Domain or Data Domain?
– Essentially it is then just an Ad Hoc Data
Warehouse. Separate components that may fit
together.
– How do we ensure we are working on the right
priorities for the entire enterprise?
Enterprise Data Model?
Torture
• Why does the creation of Enterprise Data
Models feel like torture?
– Interrogation
– Coercion
– Agreement on Excessive detail without direct
alignment to business value
Enterprise Models
Enterprise Models
Two new models
Agile Enterprise Normalized Data Model
• Confirms the major entities and the
relationships between them
– 30-50 entities
• Confirms the Data Domain
• Starts the definition of a Normalized Data
Model that will be refined over time
– Completed in 1 – 4 weeks
Agile Enterprise Normalized Data Model
• Is just enough to understand the data
domain so that the iterations can proceed
• Is not mapping all the attributes
– Is not BDUF
• Is an Information Map for the Data Domain
• Contains placeholders for refinement
– Like a User Story Map
Agile Enterprise Dimensional Data Model
• Confirms the Business Objects and the
relationships between them
– 10-15 entities
• Confirms the Business Domains
• Starts the definition of a Dimensional Data
Model that will be refined over time
– Completed in 1 – 2 weeks
Agile Enterprise Dimensional Data Model
• Is just enough to understand the business
domain so that the iterations can proceed
– And to validate the understanding of the data
domain
• Is not mapping all the attributes
– Is not BDUF
• Is an Information Map for the Business Domain
• Contains placeholders for refinement
– Like a User Story Map
Agile Information Maps
Agile Information Maps
• Agile Information Maps allow for:
– Efficient Navigation of the Data and Business
Domains
– Ability to set up ‘Neutral Zones’ for areas that
need more negotiation
– Visual communication of the topology of the
Data and Business Domains
• Easier and more accurate to validate than text
• ‘feels right’
Agile Information Maps
• Are
– Our vision
– Our Maps for the Data and Business Domains
– A guide for our solution
– Minimizes rework and refactoring
– Our Prime Directive
– Data Models
Kimball or Inmon?
Spock
• Hybrid approach
– It is only logical
– Needs of the many outweigh the needs of the
few – or the one
Spock Approach
Agile Normalized
Data Model
DM
DM
DM
ODS
DW Agile Dimensional
Data Model
Business
Domain
Spike
Spock Approach
• Business Domain Spike
• Agile Information Maps
– Agile Enterprise Normalized Data Model
– Agile Enterprise Dimensional Data Model
• Implement
– Operational Data Store
– Dimensional Data Warehouse
• Reporting can then be done from either
Business Domain Spike
• Needs to precede work on Agile Information
Maps
• Need to understand the business and
industry before you can create Data of
Business Information Maps
• Can take 1-2 weeks for an initial
understanding
– Constant refinement
Benefits of Spock Approach
• Agile Enterprise Normalized Data Model
– Validates knowledge of Data Domain
– Ensure later increments don’t uncover data
that was previously unknown and hard to
integrate
• Minimizes rework and refactoring
– True iterations
• Confirm at high level and then refine
Benefits of Spock Approach
• Agile Enterprise Dimensional Data Model
– Validates knowledge of Business Domain
– The process of ‘cooking down’ to a
Dimensional Model validates design and
identifies areas of inconsistencies or errors
• Especially true when you need to design how
changes and history will be handled
– True iterations
• Confirm at high level and then refine
Benefits of Spock Approach
• Operational Data Store
– Model data relationally to provide enterprise
level operational reports
– Consolidate and cleanse data before it is
visible to end-users
– Is used to refine the Agile Enterprise
Normalized Data Model
– Start creating reports to validate data model
immediately!
Benefits of Spock Approach
• Dimensional Data Warehouse
– Model data dimensionally to provide
enterprise level analytical reports
– Provide full historical data and context for
reports
– Is used to refine the Agile Enterprise
Dimensional Data Model
– Clients can start creating reports to validate
data model immediately!
Do we need an ODS and DW?
• Relational Analysis provides
– Validation of the Data domain
• Dimensional Analysis provides
– Validation of the Business domain
– Additional level of confirmation of the Data
domain as the relational model in translated
into a dimensional one
• Much easier for inconsistencies and errors to
hide in 300+ tables as opposed to 30+
Most Importantly..
• Operational Data Store
– Minimal Data Latency
– Current state
– Allow for efficient Operational Reporting
• Data Warehouse
– Moderate Data Latency
– Full history
– Allows for efficient Analytical Reporting
Agile Approach
• With an Agile approach you can deliver just
enough of an Operational Data Store or Data
Warehouse based on needs
– No longer do they need to be a huge deliverable
• Neither presumes a complete implementation
is required
• The Information Models allow for iterative
delivery of value
How do we work iteratively on
a Data Warehouse?
Increments versus iterations
• Increments
– Series by series – department by department
• Iterations
– Story by story – episode by episode
• Enterprise prioritization
– Work on the highest priority for the enterprise
– Not just within each series/department
Iterative Focus
• Instead of focusing on trying to have a
complete model, we focused on creating
processes that allow us to deliver changes
within 30 minutes from model to deployment
Captain, we need more Visualization!
The View Screen
The View Screen
• Enabled bridge to bridge communications
• Provided visual images in and around the
ship
– From different angles
– How did that work?
• Allowed for more understanding of the
situation
Visualization
Visualization
• Is required to:
– Report Project status
– Provide a visual report map
Kanban Board
• We used a standard Kanban board to track
stories as we worked on them
– These stories resulted in ETL, Data Model,
and Reporting tasks
– We had a Data Model/ETL board and a Report
board
– ETL and Data Model required a foundation
created by the Information Maps before we
could start on stories
• We also used thermometer imagery to report
how we were progressing according to the
schedule
– Milestones were on the thermometer along
with the number of reports that we had
completed every day
Report Visualization
Cardassian Union
Be careful how you spell that…
Data Modeling Union
• For too long the Data Modellers have not
been integrated with Software Developers
• Data Modellers have been like the
Cardassian Union, not integrated with the
Federation
Issues
• This has led to:
– Holy wars
– Each side expecting the other to follow their
schedule
– Lack of communication and collaboration
• Data Modellers need to join the ‘United
Federation of Projects’
How did we be Agile?
Tools of the trade
Tools of the Trade
• Version Control and Refactoring
• Test Automation
• Communication and Governance
• Adaptability and Change Tolerance
• Assimilation
Version Control
Version Control
• If you don’t control versions, they will control
you
• Data Models must become integrated with
the source control of the project
– In the same repository of project trunk and
branches
• You can’t just version a series of SQL files
separate from your data model
Our Version Experience
• We are using Subversion
• We are using Oracle Data Modeler as our
Modeling tool.
– It has very good integration with Subversion
– Our DBMS is SQL Server 2012
• Unlike other modeling tools, the data model
was able to be integrated in Subversion with
the rest of the project
ODM Shameless plug
• Free
• Subversion Integration
• Supports Logical and Relational data models
• Since it is free, the data models can be
shared and refined by all members of the
development team
• Currently on version 2685
How do we roll out versions?
• Create Data Model changes
• Use Red Gate SQL Compare to generate
alter script
– Generate a new DB and compare to the last
version to generate alter script
• 95% of changes deployed in less than 10
minutes
How do we roll out versions?
• We build on the Farley and Humble Blue-
Green Deployment model
– Blue – Current Version and Revision – Database
Name will be ‘ODS’
– Green – 1 Revision Old – Database Name will be
‘ODS-GREEN’
– Brown – 1 Major Version Old – Database Name
will be ‘ODS-BROWN’
Versioning
• SQL Change scripts are generated all
changes
• A full script is generated for every major
version
– A new folder is created for every major
version
– Major version folders and named after the
greek alphabet. (alpha, beta, gamma)
SQL Script version naming standards • [revision number]-[ODS/DW]-[I/A][version number]-
[subversion revision number of corresponding Data
model].sql
– Revision number – auto-incrementing
– Version Number – A999
• Alphabetic character represents major version – corresponds
with folder named after greek alphabet
• 999 indicates minor versions
– subversion revision number of corresponding Data model – allows
for a exact synchronization between Data Model and SQL Scripts
• All objects are stored within one Subversion repository
– They all share the same revision numbering
SQL Script version naming standards
• For example:
– 0-ODS-I-A001-745.sql – initial db and table
creation for current ODS version (includes
reference data)
– 1-ODS-A-A001-1574.sql – revision 1 ODS alter
script that corresponds to data model subversion
revision 1574
– 2-ODS-A-A001-1590.sql - revision 2 ODS alter
script that corresponds to data model subversion
revision 1590
SQL Script error handling
• Validation is done to prevent
– Scripts being run out of sequence
– Revision being applied without addressing
required refactoring
– Scripts being run on any environment but the
Blue environment
But what about Refactoring?
• Having Agile Information Maps has
significantly reduced refactoring
– This was an entirely new data domain for the
team
• Using the Blue-Green-Brown deployment
model has simplified required refactoring
• We have used the methods described by
Scott Ambler on the odd occasion
Good Start
Create the plan for how you
will re-factor
Refactoring Experience
• We haven’t needed to refactor much
• Since are iteratively refining we haven’t had
to re-define much
– Just adding more detail
– Main Information Maps have held together
Test Automation
Test Automation
• Enterprise was saved due to constantly
running tests on the warp engine
• Allowed for quick decision making
Automated Test Suite
• Leveraged the tSQLt Open Source
Framework
• Purchased SQL-test from Red-Gate for a
enhanced interface
• Enhanced the framework to execute tests
from four custom tables we defined
Automated Test Suite
• Leveraged Data Mapping spreadsheet that
the automated tests used
– Two database tables were loaded from the
spreadsheet
– Two additional tables contained ETL test
cases
– 13 Stored Procedures executed the tests
– 3300+ columns mapped!
Table Tests • TstTableCount: Compares record counts between source
data and target data.
• TstTableColumnDistinct: Compares counts on distinct values
of columns.
• TstTableColumnNull: Generates a report of all columns
where all the contents of a field is all null.
Column Tests • TstColumnDataMapping: Compares columns directly
assigned from a source column on a field by field basis for 5-10
rows in the target table.
• TstColumnConstantMapping: Compares columns assigned a
constant on a field by field basis for 5-10 rows in the target
table.
• TstColumnNullMapping: Compares columns assigned a Null
value on a field by field basis for 5-10 rows in the target table.
• TstColumnTransformedMapping: Compares transformed
columns on a field by field basis for 5-10 rows in the target
table.
Data Quality Tests • TstInvalidParentFKColumn: Tests that an Invalid Parent FK
value results in the records being logged and bypassed. This
record will be added to the staging table to test the process.
• TstInvalidFKColumn: Tests that an Invalid FK value results in
the value being assigned a default value or Null. This record
will be added to the staging table to test the process.
• TstInvalidColumn: Tests that an Invalid value results in the
value being assigned a default value or Null. This record will be
added to the staging table to test the process.
Process Integrity Tests • TstRestartTask: Tests that a Task can be started from the
start and subsequent steps will run in sequence.
• TstRecoverTask: Tests that a Task can be re-started in the
middle and that record are processed correctly and subsequent
steps will run in sequence.
Interested?
• Leave me a business card and I’ll send you
the design document and stored procedures
Communication
Team Communication
• Frequent Data Model walkthroughs with
application teams
• Full access to the Data model through the
Data Modeling development tool
• Data Models posted in every room for
developers to mark up with suggestions
• Database deployment to play with for every
release
Client Communication
• Frequent Conceptual Data Model
walkthroughs with clients
– Includes presentation of scenarios with data
to confirm and validate understanding
• Collaboration on the iterative plan to ensure
they agree on the process and support it
Monthly Governance Meeting – Visual Kan Ban boards reviewed
– Reports developed in the prior iterations were
demonstrated
– Business Areas were asked to submit a ranked
list of their top 10-20 data requirement/reports for
the next iteration.
Adaptability
Be Nimble
• Already discussed how we can roll out new
versions quickly
Change Tolerant Data Model
• Only add tables and columns when they are
absolutely required
• Leverage Data Domains so that attributes
are created consistently and can be changed
in unison
– Use limited number of standard domains
Change Tolerant Data Model
• Data Model needs to be loosely coupled and
have high cohesion
– Need to model the data and business and not
the applications or reports!
Change Tolerant Data Model
• Don’t model the data according to the
application’s Object Model
• Don’t model the data according to source
systems
• These items will change more frequently
than the actual data structure
• Your Data Model and Object Model should
be different!
Assimilate
Assimilate
• Assimilate Version Control, Communication,
Adaptability, Refinement, and Re-Factoring
into core project activities
– Stand ups
– Continuous Integration
– Check outs and Check Ins
• Make them part of the standard activities –
not something on the side
Our experience
Our Mission
• These practices and methods are being
used to redevelop an entire Business
Intelligence platform for a major ‘Blue’ Health
Benefits company
– Operational and Analytical Reports
• 100+ integration projects
• SAP Claims solution
Our Mission
• Integration projects are being run Agile
• 100+ team members across all projects
• SAP project is being run more in a more
traditional manner
– ‘big-bang’ SAP implementation
• I’m now also fulfilling the role of an Agile PMO
Our Challenge
• How can we deploy to production early and
often when the system is a ‘big-bang’
implementation
– We were ready to deploy ahead of clients and
other projects
– We were dependant on other conversion
projects
Our Challenge
• We are now exploring alternate ways to
deploy to production before the ‘big-bang’
implementation
– To allow the clients to use the reports and
iteratively refine them and the solution
– Also allows our team to validate data integrity
and quality iteratively
– We are now executing iterations to make this
possible
Our BI Solution
• SQL Server 2012
– Integration Services
– Reporting Services
• SharePoint 2010 Foundation
– SharePoint Integrated Reporting Solution
Our team
• Integrated team of
– 2 enterprise DBAs from the ‘Blue’
– 5 Data Analysts/DBAs/SSIS/SSRS developers
• Governance team comprised of
– Business Areas
– Systems Areas
– Stakeholders
Current Stardate
• We have completed the initial ODS and DW
development – including ETL
• We have completed a significant revision of
ODS, DW, and ETL – without major issues
• We are now finishing Report development –
reports have required database changes and
ETL changes – but no major changes!
– 300+ reports developed
Summary
• Use Agile Enterprise Data Models to provide
the initial vision and allow for refinements
• Strive for Iterations over Increments
• Align governance and prioritization with
iterations
• Plan and Integrate processes for Versioning,
Test Automation, Adaptability, Refinement
What doesn’t change?
Leadership
Leadership
• “If you want to build a ship, don't drum up
people together to collect wood and don't
assign them tasks and work, but rather teach
them to long for the endless immensity of the
sea.” ~ Antoine de Saint-Exupery
Leadership • “[A goalie's] job is to stop pucks, ... Well, yeah, that's
part of it. But you know what else it is? ... You're
trying to deliver a message to your team that things
are OK back here. This end of the ice is pretty well
cared for. You take it now and go. Go! Feel the
freedom you need in order to be that dynamic,
creative, offensive player and go out and score. ...
That was my job. And it was to try to deliver a
feeling.” ~ Ken Dryden
Three awesome books