63
Application Architectures with Hadoop Hadoop Users Group UK – November, 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Ted Malaska | @TedMalaska

Application Architectures with Hadoop - UK Hadoop User Group

Embed Size (px)

DESCRIPTION

Application Architectures with Hadoop presentation at UK Hadoop User's Group on Nov 17, 2014

Citation preview

Page 1: Application Architectures with Hadoop - UK Hadoop User Group

Application Architectures with Hadoop Hadoop Users Group UK – November, 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Ted Malaska | @TedMalaska

Page 2: Application Architectures with Hadoop - UK Hadoop User Group

2

About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook

©2014 Cloudera, Inc. All Rights Reserved.

Page 3: Application Architectures with Hadoop - UK Hadoop User Group

3

About Us •  Mark

–  Software Engineer –  Committer on Apache Bigtop, committer and PPMC member on Apache

Sentry (incubating). –  Contributor to Hadoop, Hive, Spark, Sqoop, Flume.

•  Ted –  Principal Solutions Architect –  Previously Lead Architect at FINRA –  Contributor to Apache Hadoop, HBase, Spark, Flume, Avro and Pig

©2014 Cloudera, Inc. All Rights Reserved.

Page 4: Application Architectures with Hadoop - UK Hadoop User Group

4

Case Study Clickstream Analysis

Page 5: Application Architectures with Hadoop - UK Hadoop User Group

5

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 6: Application Architectures with Hadoop - UK Hadoop User Group

6

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 7: Application Architectures with Hadoop - UK Hadoop User Group

7

Web Logs – Combined Log Format

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 8: Application Architectures with Hadoop - UK Hadoop User Group

8

Clickstream Analytics

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 9: Application Architectures with Hadoop - UK Hadoop User Group

9

Challenges of Hadoop Implementation

©2014 Cloudera, Inc. All Rights Reserved.

Page 10: Application Architectures with Hadoop - UK Hadoop User Group

10

Challenges of Hadoop Implementation

©2014 Cloudera, Inc. All Rights Reserved.

Page 11: Application Architectures with Hadoop - UK Hadoop User Group

11

Hadoop Architectural Considerations •  Storage managers?

–  HDFS? HBase? •  Data storage and modeling:

–  File formats? Compression? Schema design? •  Data movement

–  How do we actually get the data into Hadoop? How do we get it out? •  Metadata

–  How do we manage data about the data? •  Data access and processing

–  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  Orchestration –  How do we manage the workflow for all of this?

©2014 Cloudera, Inc. All Rights Reserved.

Page 12: Application Architectures with Hadoop - UK Hadoop User Group

12

Architectural Considerations Data Storage and Modeling

Page 13: Application Architectures with Hadoop - UK Hadoop User Group

13

Data Modeling Considerations •  We need to consider the following in our architecture:

–  Storage layer – HDFS? HBase? Etc. –  File system schemas – how will we lay out the data? –  File formats – what storage formats to use for our data, both raw and

processed data? –  Data compression formats?

©2014 Cloudera, Inc. All Rights Reserved.

Page 14: Application Architectures with Hadoop - UK Hadoop User Group

14

Architectural Considerations Data Modeling – Storage Layer

Page 15: Application Architectures with Hadoop - UK Hadoop User Group

15

Data Storage Layer Choices •  Two likely choices for raw data:

©2014 Cloudera, Inc. All Rights Reserved.

Page 16: Application Architectures with Hadoop - UK Hadoop User Group

16

Data Storage Layer Choices

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

•  Stores data as Hfiles on HDFS

•  Slow scans •  Fast random reads/writes

©2014 Cloudera, Inc. All Rights Reserved.

Page 17: Application Architectures with Hadoop - UK Hadoop User Group

17

Data Storage – Storage Manager Considerations

•  Incoming raw data: –  Processing requirements call for batch transformations across multiple

records – for example sessionization.

•  Processed data: –  Access to processed data will be via things like analytical queries – again

requiring access to multiple records.

•  We choose HDFS –  Processing needs in this case served better by fast scans.

©2014 Cloudera, Inc. All Rights Reserved.

Page 18: Application Architectures with Hadoop - UK Hadoop User Group

18

Architectural Considerations Data Modeling – Data Storage Format

Page 19: Application Architectures with Hadoop - UK Hadoop User Group

19

Our Format Choices… •  Raw data

–  Avro with Snappy

•  Processed data –  Parquet

©2014 Cloudera, Inc. All Rights Reserved.

Page 20: Application Architectures with Hadoop - UK Hadoop User Group

20

Architectural Considerations Data Modeling – HDFS Schema Design

Page 21: Application Architectures with Hadoop - UK Hadoop User Group

21

Recommended HDFS Schema Design •  How to lay out data on HDFS?

©2014 Cloudera, Inc. All Rights Reserved.

Page 22: Application Architectures with Hadoop - UK Hadoop User Group

22

Recommended HDFS Schema Design /user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows

©2014 Cloudera, Inc. All Rights Reserved.

Page 23: Application Architectures with Hadoop - UK Hadoop User Group

23

Architectural Considerations Data Modeling – Advanced HDFS Schema Design

Page 24: Application Architectures with Hadoop - UK Hadoop User Group

24

Partitioning

©2014 Cloudera, Inc. All Rights Reserved.

dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt

dataset file1.txt file2.txt … filen.txt

Un-partitioned HDFS directory structure

Partitioned HDFS directory structure

Page 25: Application Architectures with Hadoop - UK Hadoop User Group

25

Partitioning considerations •  What column to partition by?

–  Don’t have too many partitions (<10,000) –  Don’t have too many small files in the partitions –  Good to have partition sizes at least ~1 GB

•  We’ll partition by timestamp. This applies to both our raw and processed data.

©2014 Cloudera, Inc. All Rights Reserved.

Page 26: Application Architectures with Hadoop - UK Hadoop User Group

26

Architectural Considerations Data Ingestion

Page 27: Application Architectures with Hadoop - UK Hadoop User Group

27

File Transfers

•  “hadoop fs –put <file>” • Reliable, but not

resilient to failure. • Other options are

mountable HDFS, for example NFSv3.

©2014 Cloudera, Inc. All Rights Reserved.

Page 28: Application Architectures with Hadoop - UK Hadoop User Group

28

Streaming Ingestion •  Flume

–  Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs.

•  Kafka –  Reliable and distributed publish-subscribe messaging system.

©2014 Cloudera, Inc. All Rights Reserved.

Page 29: Application Architectures with Hadoop - UK Hadoop User Group

29

Flume vs. Kafka

• Purpose built for Hadoop data ingest.

• Pre-built sinks for HDFS, HBase, etc.

• Supports transformation of data in-flight.

• General pub-sub messaging framework.

•  Just a message transport.

• Have to use third party tool to ingest.

©2014 Cloudera, Inc. All Rights Reserved.

Page 30: Application Architectures with Hadoop - UK Hadoop User Group

30

Flume and Kafka •  Kafka Source •  Kafka Channel

©2014 Cloudera, Inc. All Rights Reserved.

Page 31: Application Architectures with Hadoop - UK Hadoop User Group

31

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to Flume Twitter, logs, JMS,

webserver Mask, re-format,

validate… DR, critical

Memory, file, Kafka

HDFS, HBase, Solr

Page 32: Application Architectures with Hadoop - UK Hadoop User Group

32

A Brief Discussion of Flume Patterns – Fan-in

•  Flume agent runs on each of our servers.

•  These agents send data to multiple agents to provide reliability.

•  Flume provides support for load balancing.

©2014 Cloudera, Inc. All Rights Reserved.

Page 33: Application Architectures with Hadoop - UK Hadoop User Group

33

Ingestion Decisions •  Historical Data

–  File transfer

•  Incoming Data –  Flume with the spooling directory source.

•  Relational Data Sources – ODS, CRM, etc. –  Sqoop

©2014 Cloudera, Inc. All Rights Reserved.

Page 34: Application Architectures with Hadoop - UK Hadoop User Group

34

Architectural Considerations Data Processing – Engines

Page 35: Application Architectures with Hadoop - UK Hadoop User Group

35

Processing Engines •  MapReduce •  Abstractions – Pig, Hive, Cascading, Crunch •  Spark •  Impala

Confidentiality Information Goes Here

Page 36: Application Architectures with Hadoop - UK Hadoop User Group

36

MapReduce •  Oldie but goody •  Restrictive Framework / Innovated Work Around •  Extreme Batch

Confidentiality Information Goes Here

Page 37: Application Architectures with Hadoop - UK Hadoop User Group

37

MapReduce Basic High Level

Confidentiality Information Goes Here

Mapper

HDFS (Replicated)

Native File System

Block of Data

Temp Spill Data

Partitioned Sorted Data

Reducer

Reducer Local Copy

Output File

Page 38: Application Architectures with Hadoop - UK Hadoop User Group

38

Abstractions •  SQL

–  Hive

•  Script/Code –  Pig: Pig Latin –  Crunch: Java/Scala –  Cascading: Java/Scala

Confidentiality Information Goes Here

Page 39: Application Architectures with Hadoop - UK Hadoop User Group

39

Spark •  The New Kid that isn’t that New Anymore •  Easily 10x less code •  Extremely Easy and Powerful API •  Very good for machine learning •  Scala, Java, and Python •  RDDs •  DAG Engine

Confidentiality Information Goes Here

Page 40: Application Architectures with Hadoop - UK Hadoop User Group

40

Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC

©2014 Cloudera, Inc. All Rights Reserved.

Page 41: Application Architectures with Hadoop - UK Hadoop User Group

41

Architectural Considerations Data Processing – What processing needs to happen?

Page 42: Application Architectures with Hadoop - UK Hadoop User Group

42

What processing needs to happen?

Confidentiality Information Goes Here

•  Sessionization •  Filtering •  Deduplication •  BI / Discovery

Page 43: Application Architectures with Hadoop - UK Hadoop User Group

43

Sessionization

Confidentiality Information Goes Here

Website visit

Visitor 1 Session 1

Visitor 1 Session 2

Visitor 2 Session 1

> 30 minutes

Page 44: Application Architectures with Hadoop - UK Hadoop User Group

44

Why sessionize?

Confidentiality Information Goes Here

Helps answers questions like: •  What is my website’s bounce rate?

–  i.e. how many % of visitors don’t go past the landing page?

•  Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? –  Which ones of those lead to most conversions (e.g. people buying things,

signing up, etc.)

•  Do attribution analysis – which channels are responsible for most conversions?

Page 45: Application Architectures with Hadoop - UK Hadoop User Group

45

How to Sessionize?

Confidentiality Information Goes Here

1.  Given a list of clicks, determine which clicks came from the same user

2.  Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session

Page 46: Application Architectures with Hadoop - UK Hadoop User Group

46

#1 – Which clicks are from same user? •  We can use:

–  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

©2014 Cloudera, Inc. All Rights Reserved.

Page 47: Application Architectures with Hadoop - UK Hadoop User Group

47

#1 – Which clicks are from same user? •  We can use:

–  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

©2014 Cloudera, Inc. All Rights Reserved.

Page 48: Application Architectures with Hadoop - UK Hadoop User Group

48

#1 – Which clicks are from same user?

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 49: Application Architectures with Hadoop - UK Hadoop User Group

49

#2 – Which clicks part of the same session?

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

> 30 mins apart = different sessions

Page 50: Application Architectures with Hadoop - UK Hadoop User Group

50 ©2014 Cloudera, Inc. All rights reserved.

Sessionization engine recommendation •  We have sessionization code in MR, Spark on github. The

complexity of the code varies, depends on the expertise in the organization.

•  We choose MR, since it’s fairly simple and maintainable code.

Page 51: Application Architectures with Hadoop - UK Hadoop User Group

51

Filtering – filter out incomplete records

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

Page 52: Application Architectures with Hadoop - UK Hadoop User Group

52

Filtering – filter out records from bots/spiders

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Google spider IP address

Page 53: Application Architectures with Hadoop - UK Hadoop User Group

53 ©2014 Cloudera, Inc. All rights reserved.

Filtering recommendation •  Bot/Spider filtering can be done easily in any of the engines •  Incomplete records are harder to filter in schema systems like

Hive, Impala, Pig, etc. •  Pretty close choice between MR, Hive and Spark •  Can be done in Flume interceptors as well •  We can simply embed this in our sessionization job

Page 54: Application Architectures with Hadoop - UK Hadoop User Group

54

Deduplication – remove duplicate records

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 55: Application Architectures with Hadoop - UK Hadoop User Group

55 ©2014 Cloudera, Inc. All rights reserved.

Deduplication recommendation •  Can be done in all engines. •  We already have a Hive table with all the columns, a simple

DISTINCT query will perform deduplication •  We use Pig

Page 56: Application Architectures with Hadoop - UK Hadoop User Group

56 ©2014 Cloudera, Inc. All rights reserved.

BI/Discovery engine recommendation •  Main requirements for this are:

–  Low latency –  SQL interface (e.g. JDBC/ODBC) –  Users don’t know how to code

•  We chose Impala –  It’s a SQL engine –  Much faster than other engines –  Provides standard JDBC/ODBC interfaces

Page 57: Application Architectures with Hadoop - UK Hadoop User Group

57

Architectural Considerations Orchestration

Page 58: Application Architectures with Hadoop - UK Hadoop User Group

58 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing…

Easier in Oozie

Page 59: Application Architectures with Hadoop - UK Hadoop User Group

59 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing the right Orchestration Tool

Better in Azkaban

Page 60: Application Architectures with Hadoop - UK Hadoop User Group

60 ©2014 Cloudera, Inc. All rights reserved.

• The best orchestration tool is the one you are an expert on – Oozie – Spark Streaming, etc. don’t require orchestration

tool

Important Decision Consideration!

Page 61: Application Architectures with Hadoop - UK Hadoop User Group

61

Putting It All Together Final Architecture

Page 62: Application Architectures with Hadoop - UK Hadoop User Group

62 ©2014 Cloudera, Inc. All rights reserved.

Final architecture

Hadoop Cluster

BI/Visualization tool (e.g.

microstrategy)

BI Analysts

Spark For machine learning and graph processing

R/Python Statistical Analysis

Custom Apps

3. Accessing

2. Processing

4. Orchestration

1. Ingestion

Operational Data Store

CRM System Via Sqoop

Web servers

Website users

Web logs Via Flume

Page 63: Application Architectures with Hadoop - UK Hadoop User Group

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Thank you