37
A Big Data Primer Stacia Misner E-mail: [email protected] Twitter: @StaciaMisner Blog: blog.datainspirations.com

Big data primer

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Big data primer

A  Big  Data  Primer  

       Stacia Misner E-mail: [email protected] Twitter: @StaciaMisner Blog: blog.datainspirations.com

Page 2: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    2

Session  Overview  

•  What’s  the  Fuss?  •  What’s  in  the  Big  Data  Stack?  •  Where  Do  I  Start?  

Page 3: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    3

What’s  the  Fuss?  

•  Some  Background…  •  Classic  Data  Analysis  versus  Big  Data  •  Why  Now?  •  Why  Bother?  

Page 4: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    4

Some  Background…  

Google Trends: “Big Data”

Page 5: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    5

Has  Big  Data  Jumped  the  Shark?  

 

 

Volume   Velocity  

Variety   Variability  

Page 6: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    6

Is  Big  Data  the  Next  Fron;er?  

Page 7: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    7

Classic  Data  Analysis  

Data Warehouse & BI Solutions

ETL

…Uses  Just  a  Subset  

Page 8: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    8

Classic  Data  Analysis  

Data Warehouse & BI Solutions

ETL

…Requires  Structure  

Page 9: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    9

Variety  Includes  Unstructured  Data  

Page 10: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    10

Big  Data  versus  Tradi;onal  BI  

http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinars

Page 11: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    11

Why  Now?  The  Times…  They  Are  A’Changin’  

1970 1 TB $1,000,000 2013 1 TB < $100

Cost of Storage Decreasing

Direct attached storage, not Enterprise SAN!

Page 12: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    12

The  Times…  They  Are  A’Changin’  

All Books 15 TB Daily Tweets 15 TB

Data Volumes Increasing

Page 13: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    13

The  Times…  They  Are  A’Changin’  

Then…

10 Years Completed in 2003

Processing Power Increasing

3 Billion Base Pairs to Analyze

Now…

1 Week At 1/10th the Cost

Page 14: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    14

Why  Now?  

Powerful, Scalable, Cheap, Elasticity

Page 15: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    15

Why  Bother?    

•  Make  more  data  available  faster    •  Deliver  access  to  more  detailed,  accurate  informa;on  to  

adjust  just-­‐in-­‐;me  •  Segment  customers  at  more  granular  level  for  

personaliza;on  of  products  and  services  •  Perform  more  sophis;cated  analy;cs  •  Improve  products  

Case Study Customer,  Product,  Promo4on  Data    -­‐>  

Personalized  Promo4ons  

Before  Big  Data   A[er  Big  Data  

8  weeks   1  week  and  dropping  

http://wiki.apache.org/hadoop/PoweredBy

Page 16: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    16

What’s  In  the  Big  Data  Stack?  

•  Key  Differences  •  Hadoop  Ecosystem  •  Hadoop  and  Analysis  Services  

Page 17: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    17

Key  Differences  

Scale Out As Needed With Commodity Hardware

Impose Schema On Read

Basically Available

Soft-state Eventually consistent

Page 18: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    18

Hadoop  Ecosystem  

HDFS  

MapReduce  

Note: This is only a subset of ecosystem!

Page 19: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    19

Problem  to  Solve  

•  Elas;city  o  Ability  to  analyze  structured,  unstructured  data  o  DW  imposes  structure  for  ques;ons  we  know  we  want  answered  

o  Need  ability  to  incorporate  other  types  of  data  on  demand  •  Scale  

o  Low  cost  commodity  hardware  o  Distributed  workload  

Page 20: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    20

Hadoop  &  Analysis  Services  –  High  Latency  

Page 21: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    21

Hadoop  &  Analysis  Services-­‐  Medium  Latency    

Linked Server HiveODBC driver

Page 22: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    22

Hadoop  &  Analysis  Services-­‐  Medium  Latency    

Analysis Management Objects (AMO) to push data into SSAS

Page 23: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    23

Hadoop  &  Analysis  Services-­‐Low  Latency  

Options: •  Impala (Cloudera) •  Spark and Shark (UC Berkeley) •  Stinger (Hortonworks)

Page 24: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    24

Where  Do  I  Start?  

•  Big  Data  Lifecycle  •  Approaches  

Page 25: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    25

Big  Data  Lifecycle  

Discovery  

Data  Prepara;on  

Model  Planning  

Model  Building  

Result  Communica;on  

Produc;on  

Look at internal/external processes – What is a challenge? Where could overwhelming advantage be useful? Formulate hypothesis

Page 26: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    26

Big  Data  Business  Models    

Page 27: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    27

Big  Data  Lifecycle  

Discovery  

Data  Prepara;on  

Model  Planning  

Model  Building  

Result  Communica;on  

Produc;on  

Explore the data in a sandbox Condition the data

Page 28: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    28

Big  Data  Lifecycle  

Discovery  

Data  Prepara;on  

Model  Planning  

Model  Building  

Result  Communica;on  

Produc;on  

Decide on methods and models Examine data for key variables

Page 29: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    29

Big  Data  Lifecycle  

Discovery  

Data  Prepara;on  

Model  Planning  

Model  Building  

Result  Communica;on  

Produc;on  

Create data sets for testing, training, and production Set up hardware environment

Page 30: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    30

Big  Data  Lifecycle  

Discovery  

Data  Prepara;on  

Model  Planning  

Model  Building  

Result  Communica;on  

Produc;on  

Validate (or not) hypothesis Share findings

Page 31: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    31

Big  Data  Lifecycle  

Discovery  

Data  Prepara;on  

Model  Planning  

Model  Building  

Result  Communica;on  

Produc;on  

Pilot project Operationalize

Page 32: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    32

Approaches  –  Store  and  Analyze  

•  Integrate  and  consolidate  o  Becer  data  quality  o  Access  to  history  o  Higher  storage  requirements  and  latency  impact  

•  Choose  hardware  o  Massively  Parallel  Processing  (PDW)  o  Tabular  –  data  compression    o  RDBMS  –  column-­‐store  o  NoSQL  –  mul;ple  variable  data  sources  

•  Analyze  data  at  rest  

Page 33: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    33

Approaches  –  Analyze  and  Store  

•  Filter  and  aggregate  data  before  adding  to  DW  o  Reduce  ac;on  ;me  (receipt  of  raw  data  to  decision  point)  to  acain  greater  business  agility  

o  Lower  storage  and  administra;ve  overhead  •  Analyze  data  in  mo;on  (complex  event  processing)  

Page 34: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    34

Overwhelmed?  Prototype  First!  

•  Define  a  small  project  –  focus  on  one  product,  for  example  

•  Capture  data  for  the  subset  of  focus  for  limited  dura;on  (one  month)  

•  Take  ac;on  on  analy;cs  and  measure  resul;ng  change  

http://www.microsoft.com/bigdata

Page 35: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    35

Session  Review  

•  What’s  the  Fuss?  •  What’s  in  the  Big  Data  Stack?  •  Where  Do  I  Start?  

Page 36: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    36

Resources  

•  Big  data  has  jumped  the  shark  (9/11/2011)  o  www.dbms2.com/2011/09/11/big-­‐data-­‐has-­‐jumped-­‐the-­‐shark/    

•  Big  data:  The  next  fron;er  for  innova;on,  compe;;on,  and  produc;vity  (aka  The  McKinsey  report)  o  hcp://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innova;on/Big_data_The_next_fron;er_for_innova;on  

•  What  a  Big  Data  Model  Looks  Like  o  hcp://blogs.hbr.org/cs/2012/12/what_a_big-­‐data_business_model.html  

 

Page 37: Big data primer

Copyright  ©  2013  by  Data  Inspira;ons  Inc.  All  rights  reserved.    37

Resources  

•   Architectures  for  Running  SSAS  on  Data  in  Hadoop  Hive  o  hcp://thinknook.com/architectures-­‐for-­‐running-­‐sql-­‐server-­‐analysis-­‐service-­‐ssas-­‐on-­‐data-­‐in-­‐hadoop-­‐hive-­‐2013-­‐02-­‐25/