Upload
hoangliem
View
221
Download
2
Embed Size (px)
Citation preview
Data Modeling’s Role in the Evolving World of Data Management
Donna BurbankCA Technologies
PAGE 2
Who am I?
• More than more than 15 years of experience in the areas of data management, metadata management, and enterprise architecture. – Currently is the Senior Director of Product Marketing for CA’s
data modeling solutions. – Brand Strategy and Product Management roles at Computer
Associates and Embarcadero Technologies – Senior consultant for PLATINUM technology’s information
management consulting division in both the U.S. and Europe. – Worked with dozens of Fortune 500 companies worldwide in the
U.S., Europe, Asia, and Africa and speaks regularly at industry conferences.
– Co-author of several books including:• Data Modeling for the Business• Data Modeling Made Simple with CA ERwin Data Modeler r8
PAGE 3
Agenda
• Culture shift
• New Technologies and Methodologies
– Big Data
– Cloud Computing
– Data Vaults
– Tagging
– Agile
• How Do Traditional Data Management Practices Fit?
PAGE 4
Technological and Culture Shift in Society
1960 1970 1980 1990 2000 2010
PAGE 5
Technological and Culture Shift in Data Management
1960 1970 1980 1990 2000 2010
• Mainframes• Flat Files• In-House Development• Waterfall Methodology
• Individual PC’s• Relational Databases• Democratization of Computing
• Client/Server Computing• Relational Databases• Data Warehousing• Packaged Applications• RAD Development
• Dot COM Revolution• Changing Business Models
• Dot COM Bubble Bursts • Data Integration• “Do More with Less”
• Cloud Computing• “Big Data”, NoSQL• Data Vaults• Agile Development
PAGE 6
Paradigm Shift
Traditional
• Top-Down, Hierarchical• “Design, then Implement”• “Passive”, Push technology• “Manageable” volumes of information• “Stable” rate of change
Modern
• Distributed, Democratic• Real-time Analysis• Collaborative, Interactive• Massive volumes of information• Rapid and Exponential rate of change
PAGE 7
Emergence
In philosophy, systems theory, science, and art, emergenceis the way complex systems and patterns arise out of a
multiplicity of relatively simple interactions. - Wikipedia
I love my new Levis jeans.
Is Levi coming to the party?
Sale #LEVIS 20% at Macys.
LOL. TTYL. Leving soon.
Relational Databases vs.
“Big Data” and NoSQL
PAGE 9
“Big Data” Survey
• How familiar are you with "Big Data" technologies?
A. Very familiar. We are using Big Data technologies now.
B. Somewhat. I know what it is, but am not using it.
C. Not at all. This is new to me.
PAGE 10
Relational Databases vs. “Big Data” and NoSQL
• With the rise of massive volume, real-time data streams, traditional relational database technologies cannot scale.– Google, Yahoo, Facebook, Twitter, MySpace
– Corporate Data: email logs, click stream data, medical diagnostics research, etc.
– Huge volumes: With Google, for example, you’re basically “ingesting” the entire internet.
• Issues that arise with Big Data– Can’t store in a single, relational database Too Big
– Can’t easily index Too Big
– Can’t write traditional SQL Unstructured
– Can’t design data structures “top-down” Data is constantly in flux and in creation.
PAGE 11
Relational Databases vs. “Big Data” and NoSQL
• Big Data Solutions address these issues by:– Distributed data storage across servers
– Stores data in files
– Post-creation analysis, real-time
– Data is structured as it is accessed
– Natural-language, application-level processing
– Distributed hash tables – pull record by name, similar to application programming
PAGE 12
Do Relational Databases Go Away?
• What Relational Databases are good for:– Structured data, that can be designed “up front” with predictable queries
– The same reasons you always have, actually:
• Ease of query and retrieval
• Ease of update - Reducing redundancy
• Increase data quality
• What “Big Data” solutions are good for:– Analyzing raw, real-time data BEFORE it goes into a Data Warehouse or Transactional Database
– Analyzing large volumes of unstructured data that is updated in real-time
I love my new Levis jeans.
Is Levi coming to the party?
Sale #LEVIS 20% at Macys.
LOL. TTYL. Leving soon.
Customer Sentiment Database
PAGE 13
Some Technologies + Terms for Big Data
• Big Data Software Frameworks:– Hadoop: An open source Apache project
– MapReduce: Google’s solution
• Memcache– Memory caching system for large-scale data-driven websites
– Big data’s equivalent of MySQL behind a web site
– With large data sets, can’t run a query each time:
• Memory layer between database and web server
• Fetches data on-demand
• MapReduce– MapReduce is the key algorithm that Hadoop uses to analyze data across servers
– Programmatic, not SQL-based
– A map transform is provided to transform an input data row of key and value to an output key/value: map(key1,value) -> list<key2,value2>
– Key aspect is that every transform is independent and the operation can be run in parallel on lists of data distributed across servers
PAGE 14
Some Technologies + Terms for Big Data
• Hive– Data warehouse infrastructure which allows SQL-like ad-hoc querying of data (in any format)
stored in Hadoop
– Not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).
• Hbase (Apache), BigTable (Google)– “Database” for Big Data
– Not relational, distributed multi-dimensional sorted map
On Premisevs.
Cloud
PAGE 16
Cloud Survey
• How familiar are you with Cloud Database technologies?
A. Very familiar. We are storing data in the cloud now.
B. Somewhat. I know what they are, but am not using them.
C. Not at all. This is new to me.
PAGE 17
On-Premise vs. Cloud
• Historically, corporate data has been stored on-premise
– Originally centralized
– Then distributed
– But still managed and owned by the company – directly purchasing hardware and software
• With the emergence of Cloud Computing, organizations can “lease” database stored as they need it
– Fewer in-house resources needed
– Ability to expand and contract usage as needed
• Driving Innovation
– With Cloud services, small, start-up companies have lower cost of entry
PAGE 18
In the Cloud Realm IT is ‘rented’ on a subscription basis
• SaaS: Software-as-a-Service
– On-Demand Software Delivery
– Examples: Web mail such as Yahoo or Google, ERP and CRP systems such as Salesforce.com
• PaaS: Platform-as-a-Service
– Development platform and solution stack as a service.
– Rent hardware, operating systems, storage and network capacity over the Internet.
– Examples: Force.com, Google
• IaaS: Infrastructure-as-a-Service
– Delivery of the computing infrastructure (hardware) as a fully outsourced service.
– Examples: Google, IBM, Amazon.com etc.
• DaaS: Database-as-a-Service
– Provides traditional database features, typically data definition, storage and retrieval, on a subscription basis over the web.
– Examples: Microsoft SQL Azure, database.com, Amazon SimpleDB
PAGE 19
The Precedent Exists
• How many households or companies make their own electricity today?
• Instead, purchase “Electricity As-a-Service” (EaaS)
PAGE 20
How do Traditional Data Models Fit?
• A Data Model is your “roadmap”for:
– What data to move to the cloud, and what to keep on-premise
– Defining data structures (physical model) and business requirements (logical model) for cloud databases
• Off-Premise doesn’t need to mean Out of your Control
MySQL
DB2
Teradata
SQL Server Sybase
Oracle
MSSQL Azure
Data Warehousesvs.
Data Vaults
PAGE 22
Data Vault Survey
• How familiar are you with Data Vault technology?
A. Very familiar. We are using Data Vaults now.
B. Somewhat. I know what it is, but am not using it.
C. Not at all. This is new to me.
PAGE 23
Data Warehouses vs. Data Vaults
• The Traditional Data Warehouse
– Based on up-front analysis & design
– Creation of central data model and database
– Data is cleansed and transformed through ETL to central store
• Data Vault's philosophy is that all data is relevant data, even if it is "wrong“
– Based on raw data sets
– Use of Links: Everything is a many-to-many relationship (no RI)
– Complete Auditability and Traceability (as a result of Sarbanes-Oxley, Basel II, etc.)
– Enable re-creation of a source system based on a specific point in time
– Claim is that traditional modeling is inflexible:
• When business rules change, the cardinality & model need to change –> also cascades to child entities
PAGE 24
Data Vaults
• Based on Parallel Processing
– Split the work so that database engines can optimize
• Index coverage
• Data redundancy
• Parallel query
• Resource utilization
– Each table has its own I/O channel associated with it
• Hypercubes• Links and Hubs: Hubs provide the keys. Links describe the keys
• Using a weighted graph helps you determine which nodes are most important
• Problem with existing ETL is that everything needs to be done at load-time not always realistic
• “Divide and Conquer—do transformations, etc. when load individual marts
PAGE 25
Is Data Modeling Still Relevant? – Yes
Source: www.danlinstedt.com
PAGE 26
Is Data Modeling Still Relevant? – Yes
Source: www.danlinstedt.com
PAGE 27
Is the Traditional Modeling Process Still Relevant? - No
• No “up-front” design
• Instead, analyzing large volumes of data in “real time”
– No up-front design, analyzing raw data “as is” to discover patterns
– Distributed processing
• Sound similar to “Big Data”? It is
Hierarchies vs.
Hash Tags
PAGE 29
Traditional Naming Standards
• In traditional data modeling and management, naming standards are prescribed, enforced, and structured
PAGE 30
A New Paradigm - Tagging
PAGE 31
Error-Free?
PAGE 32
What works?
• Traditional Naming Standards, Data Modeling Best Practices are good for up-front design and data quality– Clear naming rules, e.g. Womans Hiking Boot (modifier, prime word, class word)
– Clear definitions, e.g. “A boot is an item of footwear with close toes, which covers the calf, often with a raised heel.”
– Supertypes and subtypes
– Data cleansing (boat or boot?)
• But Remember Emergence– With the volumes of real-time data, it’s impossible to model it up-front first
– Overtime, data quality improves through self-correction and “emergence”
– After all, generally, Google searches are pretty good.
• As with the other examples, the organizational reality will be a mix of technologies.
Waterfall vs.
Agile
PAGE 34
Agile Survey
• How familiar are you with Agile methodologies?
A. Very familiar. We are using Agile now.
B. Somewhat. I know what it is, but am not using it.
C. Not at all. This is new to me.
PAGE 35
What is Agile?
Agile Software Development is a group of software
development methodologies based on iterative and
incremental development, where requirements and solutions evolve through
collaboration between self-organizing, cross-functional
teams. - Wikipedia
PAGE 36
Waterfall vs. Agile
Waterfall
• Up-Front Design• Often Long Development
Cycles•Rigorous analysis of business
requirements• Documentation• “Top Down” management,
e.g. data standards team
Agile
• Little up-front design• Short Development Cycles
(sprints)•“Learn as you go”
requirements• Little documentation “The
documentation is the code”• Team-focused, all contribute•Traditional application-
development focused (i.e. not data)
PAGE 37
Recommendation: Be more “Agile”, but not necessarily using 100% of the Agile Methodology
• Don’t Boil the Ocean
– Focus on small, manageable project that will show incremental value
– The days of the one year development cycle to build an enterprise model are gone
• Involve all Stakeholders
– All contribute: Engage in frequent conversations with business users, developers, DBAs, etc.
– The days of “the data standards team will tell you the rules” are gone (if they ever existed!)
• Focus on the Business Requirements—key to any methodology
– The goal is to support the organization’s needs, not implement a particular technology or methodology
– Do document requirements. A conceptual data model, for example, can apply to any technology
PAGE 38
A Data Model is a Communication Tool:Data Model as Mediator
• A data model is a valuable tool to communicate core requirements across technologies to a wide range of stakeholders.
SAP
DB2
DB2Oracle
Teradata
IDMS
SQL
Server
Data Model
Developers Business
Sponsors
Data Architects
ETC!ETC!
DBAsMSSQL
Azure
How has the Social and IT Culture Evolved?
PAGE 40
Paradigm Shift
Traditional
• Top-Down, Hierarchical• “Design, then Implement”• “Passive”, Push technology• “Manageable” volumes of information• “Stable” rate of change
Modern
• Distributed, Democratic• Real-time Analysis• Collaborative, Interactive• Massive volumes of information• Rapid and Exponential rate of change
PAGE 41
Summary
• We are in a period of “disruptive” technology– Rapid rate of change
– Massive volumes of data
– Social changes: more participatory, engaged
– Distributed, web-based implementations
– Real-time analysis required
• But, as with any Age of Change, the basics still apply– Relational databases are still relevant
– Data Models are valuable to document business requirements and technical implementation
– The “hard stuff” still needs to be done: analysis, data quality, etc.
• Need to Adapt and Integrate– Utilize new technologies as appropriate
– Engage extended teams in a participatory manner
– Be more “agile” and adaptive in your projects and methodologies
• Have fun! This is an exciting time to be in Data Management
Questions?
Email: [email protected]: @donnaburbankwww.erwin.com