Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Big Data Big Data AnalyticsAnalytics
DAMA NY DAMA DayOctober 17, 2013
IBM590 Madison Avenue 12th floor
New York, NY
Tom Haughey
InfoModel, LLC868 Woodfield Road
Franklin Lakes, NJ 07417201 755 3350
© InfoModel, LLC. 2013 2
AgendaAgenda• Definition• Types of data
– Structured– Semi-structured– Unstructured
• Why Big Data is important• Sources of Big Data • Levels of Big Data• Use cases for Big Data• Big Data analytics
– Data mining– Predictive analysis
• NOSQL and Big Data Landscape• The new business intelligence
architecture• How to prepare for Big Data• Pitfalls • Conclusions
© InfoModel, LLC. 2013 3
Big Data DefinitionBig Data Definition• Big data consists of high-volume, high-velocity, high-variety and high value
data and processes that demand cost-effective, innovative forms of information processing for enhanced insight and decision making
Source: modified from Gartner Glossary
© InfoModel, LLC. 2013 4
Big Data in the PastBig Data in the Past• A decade ago, Big Data was:
– A scalability problem – A performance problem
• Added to that was the difficulty of making sense of it• That is where today’s Big Data and Big Data Analytics come into play
© InfoModel, LLC. 2013 5
Why Now?Why Now?• Why can we achieve this now?• The four-minute mile syndrome
– Nobody could do it till Roger Bannister did it– Now lots of us can do it (!)
• Before we didn’t have:– The hardware technology– The software systems– The data management systems– The thought processes
© InfoModel, LLC. 2013 6
Sources of Big DataSources of Big Data
•Streaming data (e.g., stock market)
•Video archives•Large-scale e-commerce •Social and professional networks•Internet text and documents•Internet search indexing•Call detail records•Web logs•RFID•Medical records•Sensor networks•Social networks•Military surveillance•Astronomy•Video and music archives•Atmospheric science•Genomics, biogeochemical •Biological & other complex data•Interdisciplinary scientific research
© InfoModel, LLC. 2013 7
Big Data Support TechnologiesBig Data Support Technologies• The emergence of commodity servers• NOSQL file management systems and Hadoop• Inverted column databases • In memory databases and analytics • Convergence of machine learning and data mining• Management of structured and unstructured content • Support of Hadoop, Map:Reduce by major RDBMS vendors
© InfoModel, LLC. 2013 8
Types of DataTypes of Data• Structured – having a fixed and external structure (external to the data
structure itself)
• Semi-structured – having a structure imbedded in the data. Instances contain data values and metadata. The structure may vary but still needs to be planned and modeled
• Unstructured – having no known structure. Often transformed to structured or semi-structured data for processing. The structure may vary but stillneeds to be planned and modeled
© InfoModel, LLC. 2013 9
Big Data and Business AnalyticsBig Data and Business Analytics
Roll Your Own
Unknown, unstructured or semi-structured. Processed directly or at small to
large scale
•These levels affect data structure, data access and scale
Unknown, unstructured. Transformed to
structured. Processed at small to large scale
Known, structured. Processed at small to
large scale.Level 1
Level 2
Level 3
Level 4
Adapted from McKinsey
© InfoModel, LLC. 2013 10
Sample Use Cases by Data LevelSample Use Cases by Data Level• Level 1:
– Pricing: targeted price setting– Campaign lead generation– Customer experience– Pricing based on customer value
• Level 2:– Impact of marketing on sales– Market basket to determine risk– Next product to buy (NPTB)– Cross channel integration
• Level 3:– Fraud prevention– Discount targeting based on location, likelihood-to-leave, web analytics
• Level 4:– Targeted advertising, right landing page– Pricing and targeted advertising, right price and landing page– Credit line management
Adapted from McKinsey
© InfoModel, LLC. 2013 11
Business AnalyticsBusiness Analytics• Solutions used to build analytical, historical models and simulations to
create scenarios, understand current status and predict future states• Business analytics includes:
– Data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user.
– Big Data Analytics is the convergence of Big Data and Business Analytics
Big Data Big DataAnalytics
BusinessAnalytics
• Without Big Data Analytics, big data is “just a lot of data”
© InfoModel, LLC. 2013 12
Should You Kill Your Data Warehouse?Should You Kill Your Data Warehouse?
See Forbes, 8/24/2011 [Maybe don’t see !]
© InfoModel, LLC. 2013 13
Try This QueryTry This Query• Try this query on semi-structured or unstructured data on NOSQL or other
multi-structured data environment
– “Give me a breakdown of sales revenue and volume by household by month, order it by the org unit that sold the product and the org unit that owned the product, summarize it from product type, to product subgroup and product group, for the last 5 years”
• In a DW containing this data, this query can be run efficiently and fairly easily coded in SQL
• How do you do this on enormous quantities of semi-structured or unstructured data using existing technologies?
© InfoModel, LLC. 2013 14
Forms of AnalyticsForms of Analytics• Traditional BI and OLAP
– Will stay– Consumers already use these– Consumers will add to them
• Big Data Analytics– Discovery oriented – Shows value in Big Data – Can leverage new platforms:
e.g., Analytics DB – Undergoing strong
acceptance by consumer
• Traditional BI and OLAP– Well known and required. – Works well with most EDWs. – Many levels and styles of BI
• Advanced SQL – Well-known SQL-based tools/ techniques. – Can result in long, complex SQL statements
to gather, aggregate and model data• Predictive Analytics
– Data mining/statistics to understand the past and predict future events.
– Requires special tools and rock stars.• New Analytic Methods/Tools
– Visualization, artificial intelligence, natural language processing.
– Analytical DB functions: inverted column DBs, DW appliances, MapReduce, etc.
© InfoModel, LLC. 2013 15
Data MiningData Mining• The use of mathematical algorithms to find hidden relationships in the data• It can be used to:
– Find rules or approaches that worked well in the past– Identify dependencies or relationships between things – Segment or classify customers based on how well they match
something you care about– Group and cluster things that are similar to each other – Spot and identify anomalies buried in the data
Text Source: James Taylor
© InfoModel, LLC. 2013 16
Techniques Used to Mine the DataTechniques Used to Mine the Data• Just as the popularity of new tools is exploding, so are the capabilities in
data mining • Data-mining techniques fall into four major categories:
– Classification – such as targeted marketing– Association – such as market basket analysis– Sequencing – those who bought this bought that– Clustering – developing conclusions using space and distance
• NOTE: In Hadoop, querying and mining can be done through Hive, Mahout and Pig
© InfoModel, LLC. 2013 17
Predictive AnalyticsPredictive Analytics• Applys mathematical techniques to
historical data to build a future analytic model.
• It predicts:– How likely something is to be true– Its likely value– The likely sequence
• For instance, instead of: – Finding dependencies true in
historical data, find dependencies likely to be true in the future
– Grouping customers based on historical similarities, group them on likelihood that they will behave similarly in the future
• Some regard data mining as the first step in predictive analytics
• Some use the terms synonymouslySource: James Taylor
© InfoModel, LLC. 2013 18
Uses of Predictive AnalysisUses of Predictive Analysis• Its major uses are to:
– Improve efficiency– Reduce risk – Increase profitability
• Examples:– First case: Professional sports
• “Moneyball”• Who should guard LeBron?• What are individual players really worth to the team?
– Second case: banking• Customers are using a new free business checking system for
personal checks as well, increasing the cost of those accounts• Will it be more profitable to pay them to leave
– Third case: • 7% of customers account for 43% of revenue• What should we offer them?
© InfoModel, LLC. 2013 19
Who Uses Big DataWho Uses Big Data• Data Scientists / Data Teams• Knowledge workers• BI consumers• Decision makers at all levels of the business
© InfoModel, LLC. 2013 20
© InfoModel, LLC. 2013 21
Big Data AnalyticsComplex analysis of structured dataAnalysis of irregularly structured data in HadoopSocial sentiment and social network analysis
Enterprise Data Warehouse EnvironmentTraditional Reporting and Analysis
Appliance HADOOPNOSQLDW Mart
Data Integration
Files Cloud
RYO data
Web LogsOLAPTables
Consumers
DocsSensors Events XML/JSON
Big DataTraditional Data Warehouse Environment
RDBMSs
Streams
Real-time Analytics
The Big Data Analytics EnvironmentThe Big Data Analytics Environment
© InfoModel, LLC. 2013 22
Means to Achieving Value in Big DataMeans to Achieving Value in Big Data• Create integrated, analytic sandboxes • Use Hadoop is a complement to previous systems, not a replacement• Derive data from Big Data as it is needed
– Less emphasis on pre-aggregation and pre-summarization– As has been said since the opening days of client-server, send the
function to the data– Not the data to the function (as in some vertical DBMSs today)– Learn to use parallel, distributed, commodity servers
• Use Big Data for staging and well as a live archive• Virtualize Big Data
– For reuse across multiple analytical applications– For easy access to the data when it is needed
© InfoModel, LLC. 2013 23
Preparing for Big Data (BD)Preparing for Big Data (BD)• Define the business objectives
– Big Data (BD) will yield business advantages – But not without business involvement
• Understand and prepare the data for BD as in any environment– It is NOT just about slamming the data to some humongous staging area– Data modeling is here to stay, but new methods are needed– Costs and technology frustrations will increase– But business advantages will go up as well
• Get the right staff– Both BD and Analytics are new skills – Organizations will need to hire, train, and learn accordingly
• Source the right suppliers and technology– BD Analytics will be mainstream; not just for giant web firms – Tools and platforms will improve so there will be less coding– Plus improvement in scalability, performance, real time availability– Expect Hadoop and other Big Data infrastructure to become common
• Hadoop will not replace anything • Data Warehousing and BI will continue
© InfoModel, LLC. 2013 24
Pitfalls Pitfalls • Potential pitfalls that can trip up organizations on big data analytics
initiatives include:– Absence of clear business purpose– Jettisoning data management principles and practices – Absence of internal analytics skills (you need rock stars) – The high cost of hiring experienced analytics professionals– High costs of the new infrastructure (hardware and software)– Challenges in integrating Hadoop systems and data warehouses– Selecting the right vendors who offer software connectors across and to
Big Data technologies
© InfoModel, LLC. 2013 25
ConclusionsConclusions• Big Data must deliver Business Value
– That is the “sine qua non” of Big Data Analytics – Reporting, analysis and OLAP will stay – You also need “discovery” analytics, predictive analysis and data mining
• Plan your entry into big data and implement it in sensible increments– Be clear up front on the business goals– Select key sources (data from Web, other systems, social networks)
• You will have to make some upgrades:– Add new BI/DW technologies– Train your staff – Change is inevitable
• Give the business what it needs – Discovery analytics to understand change, find opportunities – Broader, more complete views of all relevant entities (e.g., customer)– Analytics targeting your industry and your organization’s specific needs
and unique collection of big data