Upload
phil-watt
View
673
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Every day, consumers, businesses and not for profit organisations generate increasing volumes of data. Initiatives such as Smart Meters in the utilities sector, along with user generated 'Web 2.0' data sources and High Energy Physics are causing an exponential growth in available data. Many business seek to take advantage of this data to analyse business performance or understand trends in customer or prospect behaviour. This analytical data often requires looking at very high volume, complex data sources. To bring this together in a format that is easy for analysts to understand and query is often very challenging - particularly for businesses when business requirements for this data change and a rapid response can mean the difference between profit and loss. This is just one of many areas that Data Integration tools and technologies are being applied - providing the 'plumbing' from a source system to a target system. DI tools are designed to offer an order of magnitude increase in developer productivity compared to using languages such as SQL, Java and .NET. This productivity allows developers to deliver more quickly, respond to changes faster or deliver more with fewer resources. According to Gartner, the market for such tools is estimated to grow to $2..7 billion by 2013, and is currently dominated by a handful of enterprise class vendors. However, a new crop of Data Integration tools is emerging, with a mix of open source and commercial offerings each that seek to challenge the dominance of the established players. This talk will discuss the history of this area of technology to help understand the conditions we see today, offer a view of the future of the market and describe how these tools can help drive value within today's business and academic communities. At the end of the talk, attendees will have an opportunity to use one of the commercial tools and make their own minds up about the value of such technology. Phil Watt is Principal Consultant at one of the world’s largest Systems Integrators, and has been working with high volume enterprise data for more than 17 years, building and designing data warehouses for customers in telco, media, utilities and financial services sectors. During the last 10 years, Phil has worked with a number of Data Integration technologies and advised many businesses about choosing a DI tool and applying best practices in their deployment.
Citation preview
1
Unlocking value from data with Data Integration Tools
Phil Watt, Principal Integration Architect, HP Business Intelligence Solutions, EMEA
29/04/2010
2
Outline Introduction Business drivers – why use a DI tool?
the challenge private sector public sector
Background and history DI tools timeline
Emerging features – and value Governance and Best Practice Selecting a tool for your situation Demonstration: Summary – followed by hands on session
29/04/2010
3
About me
29/04/2010
19 years big data 10 years Data Integration tools
High volume Complex business rules Governance and metadata management
Clients include BSkyB BT Barclays/Barclaycard Centrica Experian John Lewis Partnership Microsoft A major UK political party
Strong focus on pragmatic delivery Best practices Design patterns Tool evaluation, selection and implementation
4
Scope
29/04/2010
In scope• Data plumbing
• moving data around, and making it more useful to certain stakeholders
• Tools that help to• get data out of databases• get data into databases• transform data following some
business rules
Out of scope• Database technologies
• OLTP vs OLAP• Column versus row based storage• NoSQL movement (Hadoop,
Cassandra, etc.)• Information security
5
Glossary
29/04/2010
Data Integration
Data Governance
Master Data Managemen
t (MDM)
Data Dictionary
Data Lineage
Data Discovery/Data Profiling
6
The challenge
29/04/2010
Data growth• 60% annual global data growth through to 2012 (IDC research)• New sources of machine generated data will see this increase rapidly), e.g. Telemetry – new
Energy smart meters mean a x4000 growth in readings
Business drivers• Increased complexity of Business Requirements and Diverse sources, complex data• Consistent application of business terms across the enterprise• Time To Market (TTM) is a critical success factor• Reduce costs/improve productivity• Reduce power consumption
Collaboration• Onshore versus offshore delivery teams
Variable data quality• Data is often captured for one specific reason, then used or repurposed for different reasons
Cannot learn anything from data alone*• The model must inform the analysis• If the data does not support the model, then adjust the model
7
Data warehouse example sizes
29/04/2010
Yaho
o*eB
ay
Face
book
Wal
-mar
tLH
C
Natio
nal I
D Car
ds*
0
2
4
6
8
10
12
Petabytes
8
Public and academic examples
29/04/2010
Birmingham City Council http://www.experian.co.uk/www/pages/about_us/o
ur_clients/ http://www.qas.co.uk/company/press/new-experian
-software-helps-public-sector-to-enhance-single-citizen-view-projects-503.htm
University of Toulouse – academic medical research http://www.talend.com/open-source-provider/cases
tudy/CaseStudy_Academic_Medical_Research_EN.php
9
Benefits of DI tools
29/04/2010
Productivity improves dramatically
Vendors often claim an order of magnitude improvement•that is, coding activities alone
50% improveme
nt is realistic when
considering other non-
coding activities
Improve understanding of the overall businessusing built in metadata management tools•build data dictionaries more easliy
•support and drive data governance
Built in scalability
Parallel processing – component, pipeline and data
10
Extract, Transform and Load
29/04/2010
Extract Transform Load
e.g. CRM or ERP system Hub and spokeShared DW and ETL server
11
Extract, Load and Transform
29/04/2010
Extract Load Transform
e.g. CRM or ERP system Shared DW and ETL server
12
ETL versus ELT
29/04/2010
• Transformations often faster• No reliance on database
performance limitations• Typically scale better
ETL
• Avoids unloading large datasets for transformations and aggregations
• Best used with high performance analytical database systems such as:• Netezza, Neoview,
Oracle, Exadata Teradata, Greenplum, etc.ELT
13
Multiple sources and targets
29/04/2010
14
DI Tools Features Timeline1995 – 2005
29/04/2010
Parallelism
SCD
EAI/Message Queues
Connectors
Data Lineage
Config Mgmt
Business Metadata
CWM
Data Governance
MDM
1994 1996 1998 2000 2002 2004 2006
15
DI Tools Features Timeline from 2006
29/04/2010
SOAP/WSDL
CDC
Screen Scrapers
Test management
CEP
Push Down Processing
Semantic Metadata
Rich Dashboards
Analyst Tools
Self Service DI
2006 2007 2008 2009 2010 2011
16
Market features
29/04/2010
• Niche players acquired by established vendors• Watch out for product bloat
Industry consolidation
• Open Source versus pure commercial • Credit crunch• Established vendors often have complex pricing models
Price pressures / pricing complexity
• Increase productivity / Reduce time to market• Moving to self service for ‘purple people’
Focus on optimising workflow,
• Cool tech not enough for UK: must have strong business case
UK market very different to US
17
Gartner Magic Quadrant
Taken from research document, ‘Magic Quadrant for Data Integration Tools’
Authors: Ted Friedman, Mark A. Beyer, Eric Thoo
Full report available by registering at www.talend.com
29/04/2010
Image removed for web publication as agreed with Gartner
18
Magic Quadrant Disclaimer The Magic Quadrant is copyrighted November 25, 2009 by
Gartner, Inc. and is reused with permission. The Magic Quadrant is a graphical representation of a
marketplace at and for a specific time period. It depicts Gartner's analysis of how certain vendors measure
against criteria for that marketplace, as defined by Gartner. Gartner does not endorse any vendor, product or service
depicted in the Magic Quadrant, and does not advise technology users to select only those vendors placed in the "Leaders" quadrant.
The Magic Quadrant is intended solely as a research tool, and is not meant to be a specific guide to action.
Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
29/04/2010
19
Best practices
29/04/2010
20
Worst Practices
29/04/2010
21
Gartner advice
29/04/2010
Allocate minimum 20% to data
source analysis
Allocate 20 - 30% to mapping and
transformation rules
Avoid custom-coding or desktop
tools
Increase business user
involvement to improve success
Best Practices Mitigate Data Migration Risks andChallenges – May 2009
22
Governance and the data integration lifecycle
29/04/2010
23
Best practices
29/04/2010
Do: Spend 50% of project time doing discovery,
analysis, design Get business users involved early and often Use tools to accelerate and compress timescales Pay attention to governance and metadata
So you can: De-risk the project Reduce overall cost and timescales Achieve best possible quality
24
Selecting a tool for your situation
29/04/2010
2 stage process
Paper based
shortlist
On site Proof Of Concept (POC)
Understand the vendor
roadmapMatch to
your requiremen
ts
try to anticipate your needs
over the next 3-5
years
Do it yourself
or outsourc
e?
Is there an SI
ecosystem for the
vendors product?
Get help to choose
and upskill
Find a partner that
fits your culture and
has the right skills
25
Qualification matrix (PW )
29/04/2010
26
Demonstration
29/04/2010
27 29/04/2010
28 29/04/2010
29 29/04/2010
30 29/04/2010
31 29/04/2010
32 29/04/2010
33
Demo metrics
29/04/2010
Performance Hardware – dual core 2.0Ghz Intel Centrino, 2.5Gb
Ram Environment – WinXP, Oracle Express (DB) +DI tool
(Expressor 2.0) 3 data sources
Customers 155 MB 1000K records Today’s orders 112 MB 100K records Yesterday's orders 0.3 MB 3K
records Total data volume 267 MB 1.1M
records Execution time 72 seconds Throughput 3.7 MB/sec 41k/sec
34
Demo features
29/04/2010
Developer Productivity Graphical development Semantic Rationalisation and Re-usable Business
Rules
Demo represents a generic business scenario XML, message queues (MSMQ) , database
inputs/outputs, joins, aggregations and referential integrity management
Similar features to the ATG/Integrated Basket challenges?
35
Summary
29/04/2010
Business drivers – why use a DI tool? the challenge
private sector public sector
Background and history DI tools timeline
Emerging features – and value Governance and Best Practice Selecting a tool for your situation Demonstration:
36
Questions
29/04/2010
37
References
29/04/2010
Curt Monash http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
Wired: http://www.wired.com/wired/archive/12.04/grid.html
Zdnet: http://blogs.zdnet.com/storage/?p=213 Professor Chris Bishop:
http://conferences.theiet.org/lectures/turing/ Gartner http://www.gartner.com LHC data (2007):
http://www-conf.slac.stanford.edu/xldb07/xldb_lhc.pdf