59
1 © Copyright 20102014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. An IntroducAon to Hadoop and Cloudera Nashville Cloudera User Group, 10/23/14 Ian Wrigley, Director, EducaAonal Curriculum [email protected] @iwrigley 201405

An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

Embed Size (px)

DESCRIPTION

An introduction to the Hadoop ecosystem, and Cloudera. Presented to the Nashville Cloudera User Group on October 23, 2014

Citation preview

Page 1: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

1  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

An  IntroducAon  to    Hadoop  and  Cloudera  Nashville  Cloudera  User  Group,  10/23/14  Ian  Wrigley,  Director,  EducaAonal  Curriculum  [email protected]  @iwrigley  

201405  

Page 2: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

2  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  Mo-va-on  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 3: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

3  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Tradi-onally,  computa-on  has  been    processor-­‐bound  – RelaAvely  small  amounts  of  data  – Lots  of  complex  processing  

§ The  early  solu-on:  bigger  computers  – Faster  processor,  more  memory  – But  even  this  couldn’t  keep  up    

TradiAonal  Large-­‐Scale  ComputaAon  

Page 4: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

4  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ The  beDer  solu-on:  more  computers  – Distributed  systems  –  use  mulAple  machines  for  a  single  job  

Distributed  Systems  

“In  pioneer  days  they  used  oxen  for  heavy  pulling,  and  when  one  ox  couldn’t  budge  a  log,  we  didn’t  try  to  grow  a  larger  ox.  We  shouldn’t  be  trying  for  bigger  computers,  but  for  more  systems  of  computers.”  

           –  Grace  Hopper  

Database Hadoop Cluster

Page 5: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

5  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Challenges  with  distributed  systems  – Programming  complexity  

– Keeping  data  and  processes  in  sync  – Finite  bandwidth    – ParAal  failures  

Distributed  Systems:  Challenges  

Page 6: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

6  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Tradi-onally,  data  is  stored  in  a  central  loca-on  

§ Data  is  copied  to  processors  at  run-me  

§ Fine  for  limited  amounts  of  data  

Distributed  Systems:  The  Data  Bo>leneck  (1)  

Page 7: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

7  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Modern  systems  have  much  more  data  – terabytes+  a  day  – petabytes+  total  

§ We  need  a  new  approach…  

Distributed  Systems:  The  Data  Bo>leneck  (2)  

Page 8: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

8  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ A  radical  new  approach  to  distributed  compu-ng  – Distribute  data  when  the  data  is  stored  – Run  computaAon  where  the  data  is  stored  

Hadoop  

Page 9: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

9  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Data  is  split  into  “blocks”  when  loaded  

§ Each  task  typically  works  on  a  single  block  – Many  run  in  parallel  

§ A  master  program  manages  tasks  

Hadoop:  Very  High-­‐Level  Overview  

Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et.

Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio

ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona

irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea

un mollit anim id est o laborum ame elita tu a magna omnibus et.

Slave  Nodes   Master  

Page 10: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

10  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Applica-ons  are  wriDen  in  high-­‐level  code  

§ Nodes  talk  to  each  other  as  liDle  as  possible  

§ Data  is  distributed  in  advance  – Bring  the  computaAon  to  the  data  

§ Data  is  replicated  for  increased  availability  and  reliability  

§ Hadoop  is  scalable  and  fault-­‐tolerant  

Core  Hadoop  Concepts  

Page 11: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

11  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Adding  nodes  adds  capacity  propor-onally  

§ Increasing  load  results  in  a  graceful  decline  in  performance    – Not  failure  of  the  system  

Scalability  

Number  of  Nodes  

Capacity  

Page 12: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

12  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Node  failure  is  inevitable  

§ What  happens?  – System  conAnues  to  funcAon  – Master  re-­‐assigns  tasks  to  a  different  node  – Data  replicaAon  =  no  loss  of  data  – Nodes  which  recover  rejoin  the  cluster  automaAcally  

Fault  Tolerance  

“Failure  is  the  defining  difference  between  distributed  and  local  programming,  so  you  have  to  design  distributed  systems  with  the  expectaAon  of  failure.”                  –  Ken  Arnold                  (CORBA  designer)  

Page 13: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

13  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 14: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

14  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Hadoop    Cluster  

§ The  Hadoop  Distributed  File  System  (HDFS)  is  a  filesystem  wriDen  in  Java  

§ Sits  on  top  of  a  na-ve  filesystem  

§ Provides  storage  for  massive  amounts  of  data  – Scalable  – Fault  tolerant  – Supports  efficient  processing  with  MapReduce,  Spark,  and  other  tools  

HDFS  Basic  Concepts  

HDFS  

Page 15: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

15  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Data  files  are  split  into  blocks  and  distributed  to  data  nodes  

How  Files  are  Stored  (1)  

Block  1  

Block  2  

Block  3  

Very  Large  

Data  File  

Page 16: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

16  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Data  files  are  split  into  blocks  and  distributed  to  data  nodes  

How  Files  are  Stored  (2)  

Block  1  

Block  2  

Block  3  

Block  1  

Block  1  

Block  1  

Very  Large  

Data  File  

Page 17: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

17  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Data  files  are  split  into  blocks  and  distributed  to  data  nodes  

§ Each  block  is  replicated  on  mul-ple  nodes  (default  3x)  

How  Files  are  Stored  (3)  

Block  1  

Block  2  

Block  3  

Block  1  

Block  3  

Block  2  

Block  3  

Block  1  

Block  3  

Block  1  

Block  2  

Block  2  

Very  Large  

Data  File  

Page 18: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

18  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Data  files  are  split  into  blocks  and  distributed  to  data  nodes  

§ Each  block  is  replicated  on  mul-ple  nodes  (default  3x)  

§ NameNode  stores  metadata  

How  Files  are  Stored  (4)  

Name  Node  

Block  1  

Block  2  

Block  3  

Block  1  

Block  3  

Block  2  

Block  3  

Block  1  

Block  3  

Block  1  

Block  2  

Block  2  

Metadata:  informaAon  about  files  and  blocks  

Very  Large  

Data  File  

Page 19: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

19  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Example:  Storing  and  Retrieving  Files  (1)  

NameNode  Metadata  

/logs/031512.log: B1,B2,B3 /logs/041213.log: B4,B5

B1: A,B,D B2: B,D,E B3: A,B,C B4: A,B,E B5: C,E,D

/logs/ 031512.log

1

/logs/ 041213.log

3

45

2

Node  C  3 5

Node  E  5

42

Node  A  

41 3

2Node  B  

31

4

   

Node  D  12

5

Client  

/logs/041213.log?  

B4,B5  

Page 20: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

20  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Example:  Storing  and  Retrieving  Files  (2)  

NameNode  Metadata  

/logs/031512.log: B1,B2,B3 /logs/041213.log: B4,B5

B1: A,B,D B2: B,D,E B3: A,B,C B4: A,B,E B5: C,E,D

/logs/ 031512.log

1

/logs/ 041213.log

3

45

2

Node  C  3 5

Node  E  5

42

Node  A  

41 3

2Node  B  

31

4

   

Node  D  12

5

Client  

/logs/041213.log?  

B4,B5  

Page 21: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

21  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ HDFS  performs  best  with  a  modest  number  of  large  files  – Millions,  rather  than  billions,  of  files  – Each  file  typically  100MB  or  more  

§ Files  in  HDFS  are  “write  once”  – Files  can  be  replaced  but  not  changed  

Important  Notes  About  HDFS  

Page 22: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

22  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

MapReduce  

§ The  Mapper  – Each  Map  task  (typically)  operates  on  a  single  HDFS  block  – Map  tasks(usually)  run  on  the  node  where  the  block  is  stored  

§ Shuffle  and  Sort  – Sorts  and  consolidates  intermediate  data  from  all  mappers  – Happens  amer  all  Map  tasks  are  complete  and  before  Reduce  tasks  start  

§ The  Reducer  – Operates  on  shuffled/sorted  intermediate  data  (Map  task  output)  – Produces  final  output  

Map  

Reduce  

Shuffle    and  Sort  

Page 23: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

23  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 24: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

24  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Hadoop  Distributed  File  System  

MapReduce  

Hive   Pig  Impala  Sqoop  

The  Hadoop  Ecosystem  (1)  

Oozie   …  Flume  HBase  

Hadoop  Ecosystem  

Hadoop  Core  Components  

CDH  

Page 25: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

25  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Hive   Pig  Impala  Sqoop  

 

§ CDH  includes  many  Hadoop  Ecosystem  components  

§ Following  are  more  details  on  some  of  the  key  components  

The  Hadoop  Ecosystem  (2)  

Oozie   …  Flume  HBase  

Hadoop  Ecosystem  

Page 26: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

26  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ CDH  (Cloudera’s  Distribu-on,  including  Apache  Hadoop)  – 100%  open  source,    enterprise-­‐ready    distribuAon  of  Hadoop    and  related  projects  – The  most  complete,    tested,  and  widely-­‐  deployed  distribuAon    of  Hadoop  – Integrates  all  key    Hadoop  ecosystem  projects  

CDH  

Page 27: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

27  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 28: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

28  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ HBase:  database  layered  on  top  of  HDFS  – Provides  interacAve  access  to  data  

§ Stores  massive  amounts  of  data  – Petabytes+  

§ High  throughput  – Thousands  of  writes  per  second  (per  node)  

§ Handles  sparse  data  well  – No  wasted  space  for  a  row  with  empty    columns  

§ Limited  access  model  – OpAmized  for  lookup  of  a  row  by  key  rather  than  full  queries  – No  transacAons:  single  row  operaAons  only  

HBase:  The  Hadoop  Database  

HDFS  

Page 29: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

29  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

HBase  vs  RDBMS  

RDBMS HBase

Transactions Yes Single row only

Query language SQL get/put/scan (or use Hive or Impala)

Indexes Yes Row-key only

Max data size TBs PBs

Read/write throughput (queries per second)

Thousands Millions

Page 30: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

30  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Use  plain  HDFS  if…  – You  only  append  to  your  dataset    (no  random  write)  – You  usually  read  the  whole  dataset  (no  random  read)  

§ Use  HBase  if…  – You  need  random  write  and/or  read  – You  do  thousands  of  operaAons  per  second    on  TB+  of  data  

§ Use  an  RDBMS  if…  – Your  data  fits  on  one  big  node  – You  need  full  transacAon  support  – You  need  real-­‐Ame  query  capabiliAes  

When  To  Use  HBase  

Page 31: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

31  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  Integra-on:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 32: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

32  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ What  is  Flume?  – A  service  to  move  large  amounts  of  data  in  real  Ame  – Example:  storing  log  files  in  HDFS  

§ Flume  is  – Distributed  – Reliable  and  available  – Horizontally  scalable    – Extensible  

Flume:  Real-­‐Ame  Data  Import  

Page 33: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

33  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Flume:  High-­‐Level  Overview  

Agent     Agent   Agent  

Agent   Agent  

Agent(s)  

Agent  

compress  encrypt  

•  Pre-­‐process  data  before  storing  •    e.g.,  transform,  scrub,  enrich  

•  Store  in  any  format  •  Text,  compressed,  binary,  or  custom  sink  

•  Collect  data  as  it  is  produced  •   Files,  syslogs,  stdout  or  custom  source  

 Agent    

•  Process  in  place    •   e.g.,  encrypt,  compress  

•  Write  in  parallel  •  Scalable  throughput  

HDFS  

Page 34: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

34  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Sqoop:  SQL  to  Hadoop  – Transfers  data  between  RDBMS  and  HDFS  – Uses  a  command-­‐line  tool  or  applicaAon  connector  – Allows  incremental  imports  – Supports  virtually  all  RDBMSs  which  speak  JDBC  

– Custom  connectors  available  for  some  RDBMSs  for  increased  speed  

Sqoop:  Exchanging  Data  With  RDBMSs  

HDFS  

Sqoop      

   

RDBMS  

Page 35: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

35  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Data  Center  IntegraAon  

File Server

Relational Database(OLTP)

Data Warehouse(OLAP)

Web/App Servers

Hadoop ClusterSqoop

Flume hadoop fs

Sqoop

Page 36: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

36  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 37: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

37  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Apache  Spark  is  a  fast,  general  engine  for  large-­‐scale    data  processing  on  a  cluster  

§ Originally  developed  at  AMPLab  at  UC  Berkeley  

§ Open  source  Apache  project  

§ Provides  several  benefits  over  MapReduce  – Faster  – Be>er  suited  for  iteraAve  algorithms  

– Can  hold  intermediate  data  in  RAM,  resulAng  in  much  be>er  performance  

– Easier  API  – Supports  Python,  Scala,  Java  

– Supports  real-­‐Ame  streaming  data  processing  

Apache  Spark  

Page 38: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

38  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ MapReduce  – Widely  used,  huge  investment  already  made  – Supports  and  supported  by  many  complementary  tools  – Mature,  well-­‐tested  

§ Spark  – Flexible  – Elegant    – Fast  – Supports  real-­‐Ame  streaming  data  processing  

§ Over  -me  Spark  will  supplant  MapReduce  as  the  general  processing  framework  used  by  most  organiza-ons  

Spark  vs  Hadoop  MapReduce  

Page 39: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

39  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 40: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

40  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ The  mo-va-on:  MapReduce  is  powerful    but  hard  to  master  

§ Even  Spark  requires  a  developer  who  can  code  in  Scala  or  Python  

§ A  solu-on:  Hive  and  Pig    – Built  on  top  of  MapReduce  

– Currently  being  ported  to  run  on  top  of  Spark  for  be>er  performance  

– Leverage  exisAng  skillsets  – Data  analysts  who  use  SQL  – Programmers  who  use  scripAng  languages    

– Open  source  Apache  projects  – Hive  iniAally  developed  at  Facebook  – Pig  IniAally  developed  at  Yahoo!  

Hive  and  Pig:  High  Level  Data  Languages  

Page 41: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

41  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Hive  

§ What  is  Hive?  – HiveQL:  An  SQL-­‐like  interface  to  Hadoop  

 

SELECT * FROM purchases WHERE price > 10000 ORDER BY storeid

Page 42: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

42  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Pig  

§ What  is  Pig?  – Pig  La-n:  A  dataflow  language  for  transforming  large  data  sets  

purchases = LOAD "/user/dave/purchases" AS (itemID, price, storeID, purchaserID);

bigticket = FILTER purchases BY price > 10000; ...

Page 43: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

43  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ High-­‐performance  SQL  engine  for  vast  amounts  of  data  – Similar  query  language  to  HiveQL    – 10  to  50+  Ames  faster  than  Hive,  Pig,  or  MapReduce  

– EffecAvely,  provides  ‘real  Ame’  results  

§ Impala  runs  on  Hadoop  clusters  – Data  stored  in  HDFS  – Does  not  use  MapReduce  

§ Developed  by  Cloudera  – 100%  open  source,  released  under  the  Apache  somware  license  

Impala:  High  Performance  Queries  

Page 44: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

44  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Choose  the  best  solu-on  for  the  given  task  – Mix  and  match  as  needed  

§ MapReduce  – Low-­‐level  approach  offers  flexibility,  control,  and  performance  – More  Ame-­‐consuming  and  error-­‐prone  to  write  – Choose  when  control  and  performance  are  most  important  

§ Pig,  Hive,  and  Impala  – Faster  to  write,  test,  and  deploy  than  MapReduce  – Be>er  choice  for  most  analysis  and  processing  tasks  

Which  to  Choose?  (1)  

Page 45: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

45  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Use  Impala  when…  – You  have  analysts  familiar  with  SQL  – You  need  near  real-­‐Ame  responses  to  ad  hoc  queries  – You  have  structured  data  with  a  defined  schema  

§ Use  Hive  or  Pig  when…  – You  need  support  for  custom  file  types,  or  complex  data  types  

§ Use  Pig  when…  – You  have  developers  experienced  with  wriAng  scripts  – Your  data  is  unstructured/mulA-­‐structured  

§ Use  Hive  When…  – Your  data  is  structured  and  you  are  performing  long-­‐running,  batch  jobs  

Which  to  Choose?  (2)  

Page 46: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

46  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Comparing  Pig,  Hive,  and  Impala  

Descrip-on  of  Feature   Pig   Hive   Impala  

SQL-­‐based  query  language   No   Yes   Yes  

Schema   OpAonal   Required   Required  

Supports  user-­‐defined  func-ons   Yes   Yes   Yes  

Extensible  file  format  support   Yes   Yes   No  

Query  speed   Slow   Slow   Fast  

Accessible  via  ODBC/JDBC   No   Yes   Yes  

Page 47: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

47  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Probably  not  if  the  RDBMS  is  used  for  its  intended  purpose  

§ Rela-onal  databases  are  op-mized  for:  – RelaAvely  small  amounts  of  data  – Immediate  results  – In-­‐place  modificaAon  of  data  

§ Pig,  Hive,  and  Impala  are  op-mized  for:  – Large  amounts  of  read-­‐only  data  – Extensive  scalability  at  low  cost  

§ Pig  and  Hive  are  beDer  suited  for  batch  processing  – Impala  and  RDBMSs  are  be>er  for  interacAve  use  

Do  These  Replace  an  RDBMS?  

Page 48: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

48  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Analysis  Workflow  Example  

Import Transaction Datafrom RDBMS

Sessionize WebLog Data with Pig

Analyst using Impala shell for ad hoc queries

Analyst using Impala via BI tool

Sentiment Analysis on Social Media with Hive

Hadoop Cluster with Impala

Generate Nightly Reports using Pig, Hive, or Impala

Page 49: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

49  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  Explora-on:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 50: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

50  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Cloudera  Search  

§ Real-­‐-me,  scalable  indexing  

§ Load  any  type  of  data  

§ Text  and  faceted  searching  

Page 51: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

51  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Cloudera  Search  Example:  Twi>er  Feed  Search  

IteraAve  search  using  facets  

Full  text  search  

Page 52: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

52  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 53: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

53  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Pujng  Hadoop  into  produc-on  requires  stringent  up-mes  

§ Clusters  are  made  up  of  a  large  number  of  hosts    – Each  host  runs  mulAple  Hadoop  services  – Difficult  to  know  the  status  of  everything  

§ Inevitable  issues  will  arise  with  hardware  and  sokware  

§ Keeping  track  of  the  cluster  becomes  an  issue  – Are  all  hosts  healthy  and  working?  – Am  I  using  all  of  the  best  pracAces  for  the  service?  – Is  there  a  performance  issue  for  a  host  or  service?  – Is  the  cluster  secure?  

Reducing  Complexity  With  Cloudera  Manager  

Page 54: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

54  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Cloudera  Manager  is  a  purpose-­‐built  applica-on  designed  to  make  the  administra-on  of  Hadoop  simple  and  straighmorward  – Automates  the  installaAon  of  a  Hadoop  cluster  – Quickly  adds  and  configures  new  services  on  a  cluster  – Provides  real-­‐Ame  monitoring  of  cluster  acAvity  – Produces  reports  of  cluster  usage  – Manages  users  and  groups  who  have  access  to  the  cluster  – Integrates  with  your  exisAng  enterprise  monitoring  tools  

§ Cloudera  Manager  Express  Edi-on  – Free  

§ Cloudera  Enterprise  – Cloudera  Manager  plus  support  – Contact  us  for  pricing  

What  Is  Cloudera  Manager?  

Page 55: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

55  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Cloudera  Manager  Dashboard  

Page 56: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

56  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Health  Status  and  CharAng  

Page 57: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

57  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

PresentaAon  Topics  

An  Introduc-on  to  Hadoop  and  Cloudera  

§   The  MoAvaAon  for  Hadoop  

§   ‘Core  Hadoop’:  HDFS  and  MapReduce  

§   CDH  and  the  Hadoop  Ecosystem  

§   Data  Storage:  HBase  

§   Data  IntegraAon:  Flume  and  Sqoop  

§   Data  Processing:  Spark  

§   Data  Analysis:  Hive,  Pig,  and  Impala  

§   Data  ExploraAon:  Cloudera  Search  

§   Managing  Everything:  Cloudera  Manager  

§   Conclusion  

Page 58: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

58  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ There  are  several  more  projects  in  CDH  – CDH  supports  all  the  key  projects  you  need  

§ We  haven’t  even  talked  about  security!  – CDH  includes  Kerberos  integraAon  for  authenAcaAon  – Cloudera  Enterprise  provides  all  the  security  you  need,  whatever  your  industry  – Recently  achieved  PCI  cerAficaAon  

§ Download  the  QuickStart  VM  to  get  started  in  a  single  VM  

§ Try  Cloudera  on  a  real  cluster  for  free  

§ All  available  at  cloudera.com/live  

§ Ques-ons?  

Conclusion  

Page 59: An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

59  ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.