36
Common and Unique Use Cases for Apache Hadoop August 30, 2011

Common and unique use cases for Apache Hadoop

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Common and unique use cases for Apache Hadoop

Common  and  Unique  Use  Cases  for  Apache  Hadoop    August  30,  2011  

Page 2: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 3: Common and unique use cases for Apache Hadoop

Exploding  Data  Volumes  

•  Online  •  Web-­‐ready  devices  •  Social  media  •  Digital  content  •  Smart  grids  

•  Enterprise  •  TransacBons    •  R&D  data  •  OperaBonal  (control)  data  

Relational

Complex, Unstructured

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

 2,500  exabytes  of  new  informaBon  in  2012  with  Internet  as  primary  driver  

 

 Digital  universe  grew  by  62%  last  year  to  800K  petabytes  and  will  grow  to  1.2  “ze\abytes”  this  year    Source:  An  IDC  White  Paper  -­‐  sponsored  by  EMC.  As  the  Economy  Contracts,  the  Digital  Universe  Expands.  May  2009  

 

Page 4: Common and unique use cases for Apache Hadoop

2005   2007   2009  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

2008  2004   2006   2010  2003  2002  

Open  Source,  Web  Crawler  project  created  by  Doug  Cucng  

Publishes  MapReduce,  GFS  Paper  

Open  Source,  MapReduce  &  HDFS  project  created  by  Doug  Cucng  

Runs  4,000  Node  Hadoop  Cluster  

Hadoop  wins  Terabyte  sort  benchmark  

Launches  SQL  Support  for  Hadoop  

Releases  CDH3  and  Cloudera  Enterprise  

Origin  of  Hadoop  How  does  an  elephant  sneak  up  on  you?  

Page 5: Common and unique use cases for Apache Hadoop

MapReduce  

Hadoop  Distributed  File  System  (HDFS)  

•   Consolidates  Everything  •   Move  complex  and  relaBonal    data  into  a  single  repository  

•   Stores  Inexpensively  •   Keep  raw  data  always  available  •   Use  commodity  hardware  

•   Processes  at  the  Source  •   Eliminate  ETL  bo\lenecks  •   Mine  data  first,  govern  later    

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

What  is  Apache  Hadoop?  Open  Source  Storage  and  Processing  Engine  

Page 6: Common and unique use cases for Apache Hadoop

What  is  Apache  Hadoop?  The  Standard  Way  Big  Data  Gets  Done  

•  Hadoop  is  Flexible:  •  Structured,  unstructured  •  Schema,  no  schema  •  High  volume,  merely  terabytes  •  All  kinds  of  analyBc  applicaBons  

•  Hadoop  is  Open:  100%  Apache-­‐licensed  open  source  

•  Hadoop  is  Scalable:  Proven  at  petabyte  scale  

•  Benefits:  •  Controls  costs  by  storing  data  more  affordably  per  terabyte  than  any  other  

plalorm  •  Drives  revenue  by  extracBng  value  from  data  that  was  previously  out  of  reach  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 7: Common and unique use cases for Apache Hadoop

No  Lock-­‐In  -­‐  Investments  in  skills,  services  &    hardware  are  preserved  regardless  of  vendor  choice  

Community  Development  -­‐  Hadoop  &    related  projects  are  expanding  at  a    rapid  pace  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Rich  Ecosystem  -­‐  Dozens  of    complementary  somware,  hardware    and  services  firms    

What  is  Apache  Hadoop?  The  Importance  of  Being  Open  

Page 8: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 9: Common and unique use cases for Apache Hadoop

•  Common  uses  of  logs  

•  Find  or  count  events  (grep)  

grep  “ERROR”  file  grep  -­‐c  “ERROR”  file  

•  Calculate  metrics  (performance  or  user  behavior  analysis)  

awk  ‘{sums[$1]+=$2;  counts[$1]+=1}  END  {for(k  in  counts)  {print  sums[k]/counts  [k]}}’  

•  InvesBgate  user  sessions  

grep  “USER”  files  …  |  sort  |  less  

Log  Processing  A  Perfect  Fit  

Page 10: Common and unique use cases for Apache Hadoop

•  Shoot…too  much  data  

•  Homegrown  parallel  processing  omen  done  on  per  file  basis,  cause  it’s  easy  

•  No  parallelism  on  a  single  large  file  

Log  Processing  A  Perfect  Fit  

access_log  

Task  0  

access_log  

Task  1  

access_log  

Task  2  

Page 11: Common and unique use cases for Apache Hadoop

•  MapReduce  to  the  rescue!  

•  Processing  is  done  per  unit  of  data  

Log  Processing  A  Perfect  Fit  

Task  0  

     0-­‐64MB      64-­‐128MB  128-­‐192MB  192-­‐256MB  

Task  1   Task  2   Task  3  

Each  task  is  responsible  for  a  unit  of  data  

access_log  

Page 12: Common and unique use cases for Apache Hadoop

•  Network  or  disk  are  bo\lenecks

•  Reading  100GB  of  data  

•  14  minutes  with  1GbE  network  connecBon  

•  22  minutes  on  standard  disk  drive  

Log  Processing  A  Perfect  Fit  

grep  Bandwidth  is  limited  

access_log  

Page 13: Common and unique use cases for Apache Hadoop

•  Hadoop  to  the  rescue!  

•  Eliminates  network  bo\leneck,  data  is  on  local  disk  

•  Data  is  read  from  many,  many  disks  in  parallel  

 

Log  Processing  A  Perfect  Fit  

Task  0  

0-­‐64MB  

Task  1  

64-­‐128MB  

Task  2  

128-­‐192MB  

Task  3  

192-­‐256MB  

NodeA   NodeY  NodeX   NodeZ  

Physical  Machines  

Page 14: Common and unique use cases for Apache Hadoop

•  Hadoop  currently  scales  to  4,000  nodes  

•  Goal  for  next  release  is  10,000  nodes  

•  Nodes  typically  have  12  hard  drives  

•  A  single  hard  drive  has  throughput  of  about  75MB/second  

•  12  Hard  Drives  *  75  MB/second  *  4000  Nodes  =  3.4  TB/second  

•  That’s  bytes,  not  bits  

•  That’s  enough  bandwidth  to  read  1PB  (1000  TB)  in  5  minutes  

Log  Processing  A  Perfect  Fit  

Page 15: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 16: Common and unique use cases for Apache Hadoop

•  You  have  a  few  billion  images  of  faces  with  geo-­‐tags  

•  Tremendous  storage  problem  

•  Tremendous  processing  problem  

•  Bandwidth  

•  CoordinaBon  

Catching  `Osama’  Embarrassingly  Parallel  

Page 17: Common and unique use cases for Apache Hadoop

•  Store  the  images  in  Hadoop  

•  When  processing,  Hadoop  will  read  the  images  from  local  disk,  thousands  of  local  disks  spread  throughout  the  cluster  

•  Use  Map  only  job  to  compare  input  images  against  `needle’  image  

Catching  `Osama’  Embarrassingly  Parallel  

Page 18: Common and unique use cases for Apache Hadoop

Catching  `Osama’  Embarrassingly  Parallel  

Store  images  in  Sequence  Files  

Map  Task  0      

Map  Task  1      

Tasks  have  copy  of  `needle’  

Output  faces  `matching’  needle  

Page 19: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 20: Common and unique use cases for Apache Hadoop

•  One  of  the  most  common  use  cases  I  see  is  replacing  ETL  processes  

•  Hadoop  is  a  huge  sink  of  cheap  storage  and  processing  

•  Aggregates  built  in  Hadoop  and  exported  

•  Apache  Hive  provides  SQL  like  querying  on  raw  data  

Extract  Transform  Load  (ETL)  Everyone  is  doing  it  

Page 21: Common and unique use cases for Apache Hadoop

Extract  Transform  Load  (ETL)  Everyone  is  doing  it  

Online  DB  

`Real’  Time  System  (Website)  

ETL  

AnalyBcal  DB  

Data  Warehouse  

Business  Intelligence  ApplicaBons  

Much  blood  shed,  here  

Page 22: Common and unique use cases for Apache Hadoop

Extract  Transform  Load  (ETL)  Everyone  is  doing  it  

Online  DB  

`Real’  Time  System  (Website)  

AnalyBcal  DB  

Data  Warehouse  

Business  Intelligence  ApplicaBons  

Hadoop  Import  Export  

Page 23: Common and unique use cases for Apache Hadoop

Extract  Transform  Load  (ETL)  Everyone  is  doing  it  

Online  DB  

`Real’  Time  System  (Website)  

AnalyBcal  DB  

Data  Warehouse  

Business  Intelligence  ApplicaBons  

Hadoop  Apache  Sqoop   Apache  Sqoop  

Page 24: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 25: Common and unique use cases for Apache Hadoop

•  AnalyBcs  is  omen  simply  counBng  things  

•  Facebook  chose  HBase  to  store  it’s  massive  counter  infrastructure  (more  later)  

•  How  might  one  implement  a  counter  infrastructure  in  HBase?  

AnalyScs  in  HBase  Scaling  writes  

Page 26: Common and unique use cases for Apache Hadoop

AnalyScs  in  HBase  Scaling  writes  

URL   Counter  

com.cloudera/blog/…   154  

com.cloudera/downloads/…   923621  

com.cloudera/resources/…   2138  

User   Content   Counter  

[email protected]   NEWS   5431  

[email protected]   TECH   79310  

[email protected]   SHOPPING   59  

[email protected]   SPORTS   94214  

Individual  Page  Counters  

User  &  Content  Type  Counters  `Like’  bu\on  IMG  request    sends  HTTP  request  to  Facebook  servers  which  

increments  several  counters  

Page 27: Common and unique use cases for Apache Hadoop

AnalyScs  in  HBase  Scaling  writes  

URL   Counter  

com.cloudera/blog/…   154  

com.cloudera/downloads/…   923621  

com.cloudera/resources/…   2138  

Individual  Page  Counters  

Host  is  reversed  in  URL  as  part  of  the  key  

•  Data  is  physically  stored  in  sorted  order    

•  Scanning  all  `com.cloudera’  counters  results  in  sequenBal  I/O  

Page 28: Common and unique use cases for Apache Hadoop

•  Real-­‐Bme  counters  of  URLs  shared,  links  “liked”,  impressions  generated  

•  20  billion  events/day  (200K  events/sec)  

•  ~30  second  latency  from  click  to  count  

•  Heavy  use  of  incrementColumnValue  API  for  consistent  counters  

•  Tried  MySQL,  Cassandra,  se\led  on  HBase    h\p://Bny.cloudera.com/hbase-­‐�-­‐analyBcs  

Facebook  AnalyScs  Scaling  writes  

Page 29: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 30: Common and unique use cases for Apache Hadoop

Machine  Learning  Apache  Mahout  

Text  Clustering  on  Google  News  

Page 31: Common and unique use cases for Apache Hadoop

Machine  Learning  Apache  Mahout  

CollaboraBve  Filtering  on  Amazon  

Page 32: Common and unique use cases for Apache Hadoop

Machine  Learning  Apache  Mahout  

ClassificaBon  in  GMail  

Page 33: Common and unique use cases for Apache Hadoop

Machine  Learning  Apache  Mahout  

•  Apache  Mahout  implements  

•  CollaboraBve  Filtering    

•  ClassificaBon    

•  Clustering  

•  Frequent  itemset  

•  More  coming  with  the  integraBon  of  MapReduce.Next  

Page 34: Common and unique use cases for Apache Hadoop

Agenda  

•  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 35: Common and unique use cases for Apache Hadoop

•  Other  use  cases  

•  OpenTSDB  an  open  distributed,  scalable  Time  Series  Database  (TSDB)  

•  Building  Search  Indexes  (canonical  use  case)  

•  Facebook  Messaging  

•  Cheap  and  Deep  Storage,  e.g.  archiving  emails  for  SOX  compliance  

•  Audit  Logging  

•  Non-­‐Use  Cases  

•  Data  processing  is  handled  by  one  beefy  server  

•  Data  requires  transacBons  

Final  Thoughts  Use  the  right  tool  

Page 36: Common and unique use cases for Apache Hadoop

•  Brock  Noland  

•  [email protected]  

•  h\p://twi\er.com/brocknoland  

•  TC-­‐HUG  h\p://tch.ug  

About  the  Presenter