Upload
plato-diaz
View
27
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses. Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007. Agenda. Big News Data Integration challenges and open source BI adoption - PowerPoint PPT Presentation
Citation preview
Addressing Data Chaos:Using MySQL and Kettle to Deliver World-Class Data Warehouses
Matt Casters: Chief Architect, Data Integration and Kettle Project Founder
MySQL User Conference, Wednesday April 25, 2007
Agenda
Big News
Data Integration challenges and open source BI adoption
Pentaho company overview
Pentaho Data Integration FundamentalsSchema designKettle basicsDemonstration
Resources and links
Announcing Pentaho Data Integration 2.5.0
Again we offer big improvements over smash hit version 2.4.0Advanced error handling
Tight Apache VFS integrationAllows us to directly load and save files from any location: file
systems, web servers, FTP sites, ZIP-files, tar-files, etc.
Dimension key caching dramatically improving speedA slew of new job entries and steps (including MySQL bulk operations)Hundreds of bug fixes
Managing Data Chaos: Data Integration Challenges
Data is everywhereCustomer order information in one system, customer service information in another
Data is inconsistentThe record of the customer is different in each system
Performance is an issueRunning queries to summarize 3 years of data in the operational system takes foreverAND it brings the operational system to its knees
The data is never ALL in the data
warehouseAcquisitions, Excel spreadsheets, new applications
Customer
Service
History
Customer
Order
History
Marketing
Campaigns
Data
Warehouse
XML Acquired
System
How Pentaho Extends MySQL with ETL
MySQL Provides
Data storageSQL query executionHeavy-duty sorting, correlation, aggregation Integration point for all BI tools
Kettle Provides
Data extraction, transformation, and loadingDimensional modelingSQL generationAggregate creationData enrichment / calculationsData migration
Sample Companies that Use MySQL and Kettle from Pentaho
“With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution.”
“We selected Pentaho for its ease-of-use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.”
“We chose Pentaho because it has a full range of functionality, exceptional flexibility, and a low total cost of ownership because of its open source business model. We can start delivering value to our business users quickly with embedded, web-based reporting, while integrating our disparate data sources for more strategic benefits down the road.”
Other Kettle Users
And Thousands More……
Pentaho Introduction
World’s most popular enterprise open source BI
Suite2 million lifetime downloads, averaging 100K / monthFounded in 2004: Pioneer in professional open source BI
Key ProjectsJFreeReport ReportingKettle Data IntegrationMondrian OLAPPentaho BI PlatformWeka Data Mining
Management and BoardProven BI veterans from Business Objects, Cognos, Hyperion, SAS, OracleOpen source leaders - Larry Augustin, New Enterprise Associates, Index Ventures
MySQL Gold Partner
Overview: Data Warehouse Data Flow
From source systems …
to the data warehouse …
to reports …
to analyses …
to dashboard reports …
to better information
Pentaho Introduction
Strategic
Operational
Sales Marketing Inventory FinancialProduction
Scorecards
Analysis
Aggregates
Reports
Departmental
The star schema: a new data model is needed
Because data from various sources is “mixed” we need to design a new data model: a star schema.
A star schema is designed based on the requirements and populated by the ETL engine.
During modeling we split up the requirements in Facts and Dimensions:
Category 2001 2002 2003 TotalLaptop 2.800.726 5.272.243 2.295.147 10.368.116 Monitor 138.681 297.037 145.263 580.981 PC 2.260.053 3.893.171 1.784.220 7.937.444 Peripheral 3.028.527 5.966.100 2.857.026 11.851.653 Printer 2.795.736 5.566.608 2.285.188 10.647.532 Server 2.210.015 3.591.230 2.044.897 7.846.142 Total 13.233.738 24.586.389 11.411.741 49.231.868
DimensionsDimensions FactsFacts
The star schema: a new data model is needed• After grouping the dimension attributes by subject we get our
data model. For example:
CustomerProductOrder Line Fact Table
Date
Order
Overview: A new data model is needed
The fact table contains ONLY facts and dimension technical keys
Column Type Data type
date_tk Technical
key
Bigint
customer_tk Technical
key
Bigint
order_tk Technical
key
Bigint
product_tk Technical
key
Bigint
number_of_product
s
Fact Smallint
Turnover Fact Float
Pct_discount Fact Tinyint
Discount Fact Float
Overview: A new data model is needed
T
K
Versio
n
date_fro
m
date_t
o
cust_i
d
name NAL* Birth_date
1
0
1 100 Matt
C.
Address
1
1900-01-
01
• The dimensions contain technical fields, typically like in this customer dimension entry for customer_id = 100
NAL = Name, Address & Location
T
K
Versio
n
date_fro
m
date_t
o
cust_i
d
name NAL* Birth_date
1
0
1 T1 100 Matt
C.
Address
1
1900-01-
01
5
4
2 T1 100 Matt
C.
Address
2
1900-01-
01
• If the address changes (at time T1) we get a new entry in the dimension. This is called a Ralph Kimball type II dimension update.
Overview: A new data model is needed
NAL = Name, Address & Location
T
K
Versio
n
date_fro
m
date_t
o
cust_i
d
name NAL* Birth_date
1
0
1 T1 100 Matt
C.
Address
1
1969-02-
14
5
4
2 T1 100 Matt
C.
Address
2
1969-02-
14
• If the birth_date changes we update all entries in the dimension. This is called a Ralph Kimball type I dimension update.
Implications
We are making it easier to create reports by using star schemas
We are shifting work from the reporting side to the ETL
We need a good toolset to do ETL because of the complexities
We need to turn everything upside down
… and this is where Pentaho Data Integration comes in.
Data Transformation and Integration Examples
Data filteringIs not null, greater than, less than, includes
Field manipulationTrimming, padding, upper and lowercase conversion
Data calculations+ - X / , average, absolute value, arctangent, natural logarithm
Date manipulationFirst day of month, Last day of month, add months, week of year, day of year
Data type conversionString to number, number to string, date to number
Merging fields & splitting fields
Looking up dateLook up in a database, in a text file, an excel sheet, …
Pentaho Data Integration (Kettle) Components
SpoonConnect to data sourcesDefine transformation rules and design target schema(s)Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs
PanCommand-line execution of single, pre-defined transformation jobs
KitchenScheduler for multi-stage jobs
Pentaho BI PlatformIntegrated scheduling of transformations or jobsAbility to call real-time transformations and use output in reports and dashboards
Demonstration- create a MySQL db + repository- create dimensions- create facts- auditing & incremental loading- jobs
Case Study: Pentaho Data Integration
Organization: Flemish Government Traffic Centre
Use case: Monitoring the state of the road network
Application requirement: Integrate minute-by-minute data from
570 highway locations for analysis
Technical challenges: Large volume of data, more than 2.5
billion rows
Business Usage: Users can now compare traffic speeds based on
weather conditions, time of day, date, season
Best practices:Clearly understand business user requirements firstThere are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way
Case Study: Replacement of Proprietary Data Integration
Organization: Large, public, North American based genetics
and pharmaceutical research firm
Application requirement: Data warehouse for analysis of
patient trials, and research spending
Incumbent BI vendor: Oracle (Oracle Warehouse Builder)
Decision criteria: Ease of use, openness, cost of ownership“It was so much quicker and easier to do the things we wanted to do, and so much easier to maintain when our users’ business requirements change.”
Best practices:Evaluate replacement costs holisticallyTreat migrations as an opportunity to improve a deployment, not just move itGood deployments are iterative and evolve regularly – if users like what you give them, they will probably ask for more
Summary and Resources
Pentaho and MySQL can address help you manage your data infrastructureExtraction, Transformation and Loading for Data Warehousing and Data Migration
kettle.pentaho.orgKettle project homepage
kettle.javaforge.comKettle community website: forum, source, documentation, tech tips, samples, …
www.pentaho.org/download/All Pentaho modules, pre-configured with sample dataDeveloper forums, documentationVentana Research Open Source BI Survey
www.mysql.comWhite paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htmlKettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html