48
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 ApplicaAon Architectures with Apache Hadoop Mark Grover | @mark_grover NYC HUG slideshare.com/markgrover October 14 th , 2014 ©2014 Cloudera, Inc. All Rights Reserved.

NYC HUG - Application Architectures with Apache Hadoop

Embed Size (px)

DESCRIPTION

Presentation at NYC HUG on Application Architectures with Apache Hadoop

Citation preview

Page 1: NYC HUG - Application Architectures with Apache Hadoop

1

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

DO  NOT  USE  PUBLICLY  PRIOR  TO  10/23/12  

ApplicaAon  Architectures  with  Apache  Hadoop  Mark  Grover  |  @mark_grover  NYC  HUG  slideshare.com/markgrover  October  14th,  2014  

©2014 Cloudera, Inc. All Rights Reserved.

Page 2: NYC HUG - Application Architectures with Apache Hadoop

About  Me  •  CommiPer  on  Apache  Bigtop,  commiPer  and  PPMC  member  on  Apache  Sentry  (incubaAng).  

•  Contributor  to  Hadoop,  Hive,  Spark,  Sqoop,  Flume.  •  SoWware  developer  at  Cloudera  • @mark_grover  

2 ©2014 Cloudera, Inc. All Rights Reserved.

Page 3: NYC HUG - Application Architectures with Apache Hadoop

Co-­‐authoring  O’Reilly  book  

• @hadooparchbook  •  hadooparchitecturebook.com  •  Strata  Hadoop  World  Tutorial  

•  at  9  AM  tomorrow  

©2014 Cloudera, Inc. All Rights Reserved. 3

Page 4: NYC HUG - Application Architectures with Apache Hadoop

4

Click  Stream  Analysis  

Case  Study  

©2014 Cloudera, Inc. All Rights Reserved.

Page 5: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 5  

Page 6: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 6  

Page 7: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 7  

Page 8: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 8  

Page 9: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 9  

Page 10: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 10  

Page 11: NYC HUG - Application Architectures with Apache Hadoop

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 11  

Page 12: NYC HUG - Application Architectures with Apache Hadoop

Web  Logs  –  Combined  Log  Format  

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

12  

Page 13: NYC HUG - Application Architectures with Apache Hadoop

Clickstream  AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

13  

Page 14: NYC HUG - Application Architectures with Apache Hadoop

Similar  use-­‐cases  

•  Sensors  –  heart,  agriculture,  etc.  •  Casinos  –  session  of  a  person  at  a  table  

©2014 Cloudera, Inc. All Rights Reserved. 14

Page 15: NYC HUG - Application Architectures with Apache Hadoop

Challenges  of  Hadoop  ImplementaAon  

©2014 Cloudera, Inc. All Rights Reserved. 15  

Page 16: NYC HUG - Application Architectures with Apache Hadoop

Challenges  of  Hadoop  ImplementaAon  

©2014 Cloudera, Inc. All Rights Reserved. 16  

Page 17: NYC HUG - Application Architectures with Apache Hadoop

Other  challenges  -­‐  Architectural  ConsideraAons    

•  Storage  managers?  •  HDFS?  HBase?  

•  Data  storage  and  modeling:  •  File  formats?  Compression?  Schema  design?  

•  Data  movement  •  How  do  we  actually  get  the  data  into  Hadoop?  How  do  we  get  it  out?  

•  Metadata  •  How  do  we  manage  data  about  the  data?  

•  Processing  •  How  can  we  transform  it?  How  do  we  query  it?  

•  OrchestraAon  •  How  do  we  manage  the  workflow  for  all  of  this?  

©2014 Cloudera, Inc. All Rights Reserved. 17

Page 18: NYC HUG - Application Architectures with Apache Hadoop

18

Since  that’s  all  what  the  Ame  allows  todayJ  

2.  Processing  

©2014 Cloudera, Inc. All Rights Reserved.

Page 19: NYC HUG - Application Architectures with Apache Hadoop

Processing  

• De-­‐duplicaAon  •  Filtering  •  SessionizaAon  

19 ©2014 Cloudera, Inc. All Rights Reserved.

Page 20: NYC HUG - Application Architectures with Apache Hadoop

DeduplicaAon  –  remove  duplicate  records  

©2014 Cloudera, Inc. All Rights Reserved. 20  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 21: NYC HUG - Application Architectures with Apache Hadoop

Filtering  –  filter  out  invalid  records  

©2014 Cloudera, Inc. All Rights Reserved. 21  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

Page 22: NYC HUG - Application Architectures with Apache Hadoop

SessionizaAon  

©2014 Cloudera, Inc. All Rights Reserved. 22  

Website  visit  

Visitor  1  Session  1  

Visitor  1  Session  2  

Visitor  2  Session  1  

>  30  minutes  

Page 23: NYC HUG - Application Architectures with Apache Hadoop

Why  sessionize?  

Helps  answers  quesAons  like:  • What  is  my  website’s  bounce  rate?  

•  i.e.  how  many  %  of  visitors  don’t  go  past  the  landing  page?  • Which  markeAng  channels  (e.g.  organic  search,  display  ad,  etc.)  are  leading  to  most  sessions?  

• Which  ones  of  those  lead  to  most  conversions  (e.g.  people  buying  things,  signing  up,  etc.)  

• Do  aPribuAon  analysis  –  which  channels  are  responsible  for  most  conversions?  

23 ©2014 Cloudera, Inc. All Rights Reserved.

Page 24: NYC HUG - Application Architectures with Apache Hadoop

SessionizaAon  

©2014 Cloudera, Inc. All Rights Reserved. 24  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166

Page 25: NYC HUG - Application Architectures with Apache Hadoop

How  to  Sessionize?  

1.  Given  a  list  of  clicks,  determine  which  clicks  came  from  the  same  user  

2.  Given  a  parAcular  user's  clicks,  determine  if  a  given  click  is  a  part  of  a  new  session  or  a  conAnuaAon  of  the  previous  session  

25 ©2014 Cloudera, Inc. All Rights Reserved.

Page 26: NYC HUG - Application Architectures with Apache Hadoop

#1  –  Which  clicks  are  from  same  user?  

• We  can  use:  •  IP  address  (244.157.45.12)  •  Cookies  (A9A3BECE0563982D)  •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")  

26 ©2014 Cloudera, Inc. All Rights Reserved.

Page 27: NYC HUG - Application Architectures with Apache Hadoop

#1  –  Which  clicks  are  from  same  user?  

• We  can  use:  •  IP  address  (244.157.45.12)  •  Cookies  (A9A3BECE0563982D)  •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")  

27 ©2014 Cloudera, Inc. All Rights Reserved.

Page 28: NYC HUG - Application Architectures with Apache Hadoop

#1  –  Which  clicks  are  from  same  user?  

©2014 Cloudera, Inc. All Rights Reserved. 28  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 29: NYC HUG - Application Architectures with Apache Hadoop

#2  –  Which  clicks    part  of  the  same  session?  

©2014 Cloudera, Inc. All Rights Reserved. 29  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

>  30  mins  apart  =  different  sessions  

Page 30: NYC HUG - Application Architectures with Apache Hadoop

30

github.com/hadooparchitecturebook/clickstream-­‐tutorial  

SessionizaAon  in  MapReduce  

©2014 Cloudera, Inc. All Rights Reserved.

Page 31: NYC HUG - Application Architectures with Apache Hadoop

SessionizaAon  in  MapReduce  

31 ©2014 Cloudera, Inc. All Rights Reserved.

Map  

Reduce  

Reduce  

Log  line   IP1,  log  lines  

IP1,  log  line

s  

Log  line,  session  ID  

Map  

Map  

Log  line  

Log  line   IP2,  log  lines  

IP2,  log  lines   Log  line,  session  ID  

Page 32: NYC HUG - Application Architectures with Apache Hadoop

Mapper  for  SessionizaAon  

32 ©2014 Cloudera, Inc. All Rights Reserved.

 public  static  class  SessionizeMapper                          extends  Mapper<Object,  Text,  IpTimestampKey,  Text>  {                  private  Matcher  logRecordMatcher;                    public  void  map(Object  key,  Text  value,  Context  context                  )  throws  IOException,  InterruptedException  {                          logRecordMatcher  =  logRecordPattern.matcher(value.toString());                            //  We  only  emit  something  out  if  the  record  matches  with  our  regex.  Otherwise,  we  assume  the  record  is  busted  and  simply  ignore  it                          if  (logRecordMatcher.matches())  {                                  String  ip  =  logRecordMatcher.group(1);                                  DateTime  timestamp  =  DateTime.parse(logRecordMatcher.group(2),  TIMESTAMP_FORMATTER);                                  Long  unixTimestamp  =  timestamp.getMillis();                                  IpTimestampKey  outputKey  =  new  IpTimestampKey(ip,  unixTimestamp);                                  context.write(outputKey,  value);                          }                  }          }  

Page 33: NYC HUG - Application Architectures with Apache Hadoop

Reducer  for  SessionizaAon  

33 ©2014 Cloudera, Inc. All Rights Reserved.

public  static  class  SessionizeReducer                                extends  Reducer<IpTimestampKey,  Text,  IpTimestampKey,  Text>  {    

       private  Text  result  =  new  Text();            

       public  void  reduce(IpTimestampKey  key,  Iterable<Text>  values,                                                            Context  context    

       )  throws  IOException,  InterruptedException  {                    //  The  sessionId  generated  here  is  per  day,  per  IP.  So,  any  queries                    //  that  will  be  done  as  if  this  session  ID  were  global,  would  require                    //  a  combination  of  the  day  in  question  and  IP  as  well.                    String  sessionId  =  null;                    Long  lastTimeStamp  =  null;                    for  (Text  value  :  values)  {                    String  logRecord  =  value.toString();    

Page 34: NYC HUG - Application Architectures with Apache Hadoop

Reducer  for  SessionizaAon  

34 ©2014 Cloudera, Inc. All Rights Reserved.

               //  If  this  is  the  first  record  for  this  user  or  it's  been  more  than  the  timeout  since                    //  the  last  click  from  this  user,  let's  increment  the  session  ID.                    if  (lastTimeStamp  ==  null  ||  (key.getUnixTimestamp()  -­‐  lastTimeStamp  >  SESSION_TIMEOUT_IN_MS))  {                            sessionId  =  key.getIp()  +  "+"  +  key.getUnixTimestamp();                    }                    lastTimeStamp  =  key.getUnixTimestamp();                    result.set(logRecord  +  "  "  +  sessionId);                    //  Since  we  only  care  about  printing  out  the  entire  record  in  the  result,  with  session  ID  appended                    //  at  the  end,  we  just  emit  out  "null"  for  the  key                    context.write(null,  result);                    }            }    }      

Page 35: NYC HUG - Application Architectures with Apache Hadoop

Secondary  sorAng  –  by  Amestamp  

• Need  records  to  reducer  to  be  grouped  by  IP  address  and  sorted  by  Amestamp  –  a  concept  called  secondary  sor/ng  

•  Instead  of  using  just  IP  address  as  map  output  key  and  reduce  input  key  

• We  use  a  composite  key  (IP,  Amestamp)  as  map  output  key  and  reduce  input  key  

35 ©2014 Cloudera, Inc. All Rights Reserved.

Page 36: NYC HUG - Application Architectures with Apache Hadoop

Secondary  sorAng  –  vocabulary  

•  Composite  key  –  IP  address,  Amestamp  • Natural  key  –  IP  address  •  Secondary  sort  key  -­‐  Amestamp  

36 ©2014 Cloudera, Inc. All Rights Reserved.

Page 37: NYC HUG - Application Architectures with Apache Hadoop

Secondary  sorAng  

•  Custom  Grouping  Comparator  –  on  Natural  Key  (IP)  •  Custom  Sort  Comparator  –  on  Composite  Key  (IP,  address)  •  Custom  ParAAoner  –  on  Natural  Key  (IP)    job.setGroupingComparatorClass(NaturalKeyComparator.class);        job.setSortComparatorClass(CompositeKeyComparator.class);  job.setPartitionerClass(NaturalKeyPartitioner.class);    

37 ©2014 Cloudera, Inc. All Rights Reserved.

Page 38: NYC HUG - Application Architectures with Apache Hadoop

38

Final  Architecture  

©2014 Cloudera, Inc. All Rights Reserved.

Page 39: NYC HUG - Application Architectures with Apache Hadoop

©2014 Cloudera, Inc. All Rights Reserved. 39  

Hadoop  Cluster  

BI/VisualizaAon  tool  (e.g.  

microstrategy)  

BI  Analysts  

Spark   For  machine  learning  and  graph  processing  

R/Python   StaAsAcal  Analysis  

Custom  Apps  

3.  Accessing  

2.  Processing  

4.  OrchestraAon  via  Oozie  1.  IngesAon  

OperaAonal  Data  Store  

CRM  System  Via  Sqoop  

Web  servers  

Website  users  

Page 40: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  High  Level  Overview  

40  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Page 41: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  High  Level  Overview  

41  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Page 42: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  IngesAon  

42  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Flume  Agent  

Flume  Agent  

Flume  Agent  

Flume  Agent  

Fan-­‐in    PaPern  

MulA  Agents  for    Failover  and  rolling  restarts  

HDFS    

©2014 Cloudera, Inc. All Rights Reserved.

Page 43: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  High  Level  Overview  

43  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Page 44: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  Storage  and  Processing  

44  

/etl/weblogs/20140331/  /etl/weblogs/20140401/  …  

Data  Processing  /data/markeAng/clickstream/bouncerate/  /data/markeAng/clickstream/aPribuAon/  …  

©2014 Cloudera, Inc. All Rights Reserved.

Page 45: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  High  Level  Overview  

45  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Page 46: NYC HUG - Application Architectures with Apache Hadoop

Final  Architecture  –  Data  Access  

46  

Hive/Impala  

BI/AnalyAcs  Tools  

DWH  Sqoop  

Local  Disk  

R,  etc.  

DB  import  tool  

JDBC/ODBC  

©2014 Cloudera, Inc. All Rights Reserved.

Page 47: NYC HUG - Application Architectures with Apache Hadoop

Contact  info  • Mark  Grover  

• @mark_grover  •  www.linkedin.com/in/grovermark  

•  Slides  at  slideshare.net/markgrover  

47 ©2014 Cloudera, Inc. All Rights Reserved.

Page 48: NYC HUG - Application Architectures with Apache Hadoop

48 ©2014 Cloudera, Inc. All Rights Reserved.