36
© 2018 MapR Technologies MapR Confidential 1 Leon Clayton - Principal Solutions Engineer Hadoop Essentials

MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 1

Leon Clayton - Principal Solutions Engineer

Hadoop Essentials

Page 2: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

{“about” : “me”}

Leon Clayton• MapR

• Solution Architect • EMC

• EMEA Isilon Lead• NetApp

• Performance Team Lead• Philips

• HPC team lead • Royal Navy

• Type 42 – Real time systems engineer

[email protected]

Page 3: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 3

Google

Page 4: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 4

History of Hadoop & Map Reduce

Page 5: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 5

Google Solution

GFS + Big Table + Map/Reduce

Page 6: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 6

Google File System (GFS)

Page 7: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 7

BigTable

TABLET

Page 8: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 8

Map/Reduce

Page 9: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 9

Hadoop

Page 10: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 10

Exploit Locality of Reference

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

datadata

functionfunction

Hadoop Architecture

SAN/NAS

datadata datadata datadata

datadata datadata datadata

datadata datadata datadata

datadata datadata datadata

functionfunction

RDBMS

Traditional Architecture

functionfunction

App

functionfunction

App

functionfunction

App

Page 11: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 11

Hadoop

• MapReduce is a programing model for distributed computing – Google popularized it - 2004– Apache Hadoop copied it - 2005– MapR Technologies improved it - 2009

• Hadoop storage and analysis:– 10.000s of disks on – 1000s of machines with near linear scaling– Commodity hardware (CPU, RAM, disk, etc...)– No specialized hardware– Handle Big Data – Petabytes and more

Page 12: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 12

GFS + Big Table + Map/Reduce

Major Components of Hadoop

Page 13: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 13

MapR

GFS + Big Table + Map/Reduce

Page 14: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 14

Phases of MapReduce

Map Job partitioned into “splits”

Combine (optional) Makes work easier for the reducer(s)

Shuffle and sort Map output sent to reducer(s) using a hash

Reduce

Page 15: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 15

Inside MapReduce

Input Map Combine

Shuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

come, 6has, 8the, 4time, 14…

Page 16: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 16

Hive and Pig

• Hive: data warehousing application in Hadoop– Query language is HQL, variant of SQL– Tables stored on HDFS as flat files– Developed by Facebook, now open source

• Pig: large-scale data processing system– Scripts are written in Pig Latin, a dataflow language– Developed by Yahoo!, now open source– Roughly 1/3 of all Yahoo! internal jobs

• Common idea:– Provide higher-level language to facilitate large-data processing– Higher-level language “compiles down” to MapReduce jobs

Page 17: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 17

Hive Background

• Started at Facebook• Data was collected by nightly cron jobs into Oracle DB• “ETL” via hand-coded python• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now

10x that

Source: cc-licensed slide by Cloudera

Page 18: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 18

Hive Data Model

• Tables– Typed columns (int, float, string, boolean)– Also, list: map (for JSON-like data)

• Partitions– For example, range-partition tables by date

• Buckets– Hash partitions within ranges (useful for sampling, join optimization)

Source: cc-licensed slide by Cloudera

Page 19: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 19

Hive: Example• Hive looks similar to an SQL database• Relational join on two tables:

– Table of word counts from Shakespeare collection– Table of word counts from the bible

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;

the 25848 62394I 23031 8854and 19671 38985to 18038 13526of 16700 34654a 14170 8057you 12702 2720my 11297 4135in 10797 12445is 8882 6884

Page 20: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 20

Hadoop Coding Example

– https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

Lets talk through this

Page 21: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 21

The Power of the Open Source Community

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Spark Streaming

Storm

StreamingNoSQL & Search

Juju

Provisioning &

Coordination

Sahara

ML, Graph

Mahout

MLLib

GraphX

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance

Pig

Cascading

Spark

Batch

MapReduce v1 & v2

Tez

HBase

Solr

Hive

Impala

Spark SQL

Drill

SQL

Sentry Oozie ZooKeeperSqoop

Flume

Data Integration& Access

HttpFS

Hue

Data PlatformMapR-FS MapR-DB

Man

agem

ent

Page 22: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 22

Example Big Data Architecture – Hands on architecture

Paste your MapR distribution for Hadoop diagram from Part A, (slide 2) here MapR-DB MapR-FS

MapR Data Platform

Distribution including Apache Hadoop

Agile, self-service data exploration using ANSI SQL

Sentiment Analysis Stanford NLP

Multi-tenancy: job/data placement

control, volumes

Access controls: file, table, column,

column family, doc, sub-doc levels

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMSENSORS

BLOGS, TWEETS,LINK DATA

DATA WAREHOUSES, DATA MARTS

Auditing: compliance, analyze user accesses

Snapshots:track data lineage

and history

Page 23: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 23

Lambda Architecture

BATCH RECOMPUTEBATCH RECOMPUTE

REAL-TIME VIEWSREAL-TIME VIEWS

REAL-TIME INCREMENTREAL-TIME INCREMENT

BATCH VIEWSBATCH VIEWSDATA

SOURCEDATA

SOURCE MERGED VIEW

MERGED VIEW

PRECOMPUTE VIEWSPRECOMPUTE VIEWSALL DATAALL DATA

INCREMENT VIEWSINCREMENT VIEWSPROCESS STREAMPROCESS STREAM

REAL-TIME DATAREAL-TIME DATA

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

BATCH LAYER

SPEED LAYER

SERVING LAYER

MERGE

Δ

ANALYTICSANALYTICS

APPLICATIONAPPLICATION

Page 24: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 24

Lambda Architecture

Overview• Invented by Nathan Marz at Twitter• Very fault tolerant, because of recomputation• Combines the complete history with the newest data• Real-time results

Page 25: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 25

Lambda Architecture

BATCH RECOMPUTEBATCH RECOMPUTE

REAL-TIME VIEWSREAL-TIME VIEWS

REAL-TIME INCREMENTREAL-TIME INCREMENT

BATCH VIEWSBATCH VIEWSDATA

SOURCEDATA

SOURCE MERGED VIEW

MERGED VIEW

PRECOMPUTE VIEWSPRECOMPUTE VIEWSALL DATAALL DATA

INCREMENT VIEWSINCREMENT VIEWSPROCESS STREAMPROCESS STREAM

REAL-TIME DATAREAL-TIME DATA

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

BATCH LAYER

SPEED LAYER

SERVING LAYER

MERGE

Δ

ANALYTICSANALYTICS

APPLICATIONAPPLICATION

Page 26: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 26

Lambda Architecture

Data ingest• A message queue that feeds the system• Apache Kafka

Page 27: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 27

Lambda Architecture

BATCH RECOMPUTEBATCH RECOMPUTE

REAL-TIME VIEWSREAL-TIME VIEWS

REAL-TIME INCREMENTREAL-TIME INCREMENT

BATCH VIEWSBATCH VIEWSDATA

SOURCEDATA

SOURCE MERGED VIEW

MERGED VIEW

PRECOMPUTE VIEWSPRECOMPUTE VIEWSALL DATAALL DATA

INCREMENT VIEWSINCREMENT VIEWSPROCESS STREAMPROCESS STREAM

REAL-TIME DATAREAL-TIME DATA

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

BATCH LAYER

SPEED LAYER

SERVING LAYER

MERGE

Δ

ANALYTICSANALYTICS

APPLICATIONAPPLICATION

Page 28: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 28

Lambda Architecture

Master Data Set• Append only• Data is stored raw• MapR-FS / HDFS• Different users for appending and reading

Page 29: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 29

Lambda Architecture

BATCH RECOMPUTEBATCH RECOMPUTE

REAL-TIME VIEWSREAL-TIME VIEWS

REAL-TIME INCREMENTREAL-TIME INCREMENT

BATCH VIEWSBATCH VIEWSDATA

SOURCEDATA

SOURCE MERGED VIEW

MERGED VIEW

PRECOMPUTE VIEWSPRECOMPUTE VIEWSALL DATAALL DATA

INCREMENT VIEWSINCREMENT VIEWSPROCESS STREAMPROCESS STREAM

REAL-TIME DATAREAL-TIME DATA

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

BATCH LAYER

SPEED LAYER

SERVING LAYER

MERGE

Δ

ANALYTICSANALYTICS

APPLICATIONAPPLICATION

Page 30: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 30

Lambda Architecture

Batch processes• Runs periodically • If a bug wreaks havoc, just fix + recompute• Data that comes in while a new batch is processing goes

via the speed layer• Typically compute aggregates or train models

Page 31: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 31

Lambda Architecture

BATCH RECOMPUTEBATCH RECOMPUTE

REAL-TIME VIEWSREAL-TIME VIEWS

REAL-TIME INCREMENTREAL-TIME INCREMENT

BATCH VIEWSBATCH VIEWSDATA

SOURCEDATA

SOURCE MERGED VIEW

MERGED VIEW

PRECOMPUTE VIEWSPRECOMPUTE VIEWSALL DATAALL DATA

INCREMENT VIEWSINCREMENT VIEWSPROCESS STREAMPROCESS STREAM

REAL-TIME DATAREAL-TIME DATA

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

BATCH LAYER

SPEED LAYER

SERVING LAYER

MERGE

Δ

ANALYTICSANALYTICS

APPLICATIONAPPLICATION

Page 32: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 32

Lambda Architecture

Streaming• Process events immediately as they arrive• Produces incremental updates or predictions• E.g. “+1 view for URL x” or “Customer y belongs to cluster

z”

Page 33: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 33

Lambda Architecture

BATCH RECOMPUTEBATCH RECOMPUTE

REAL-TIME VIEWSREAL-TIME VIEWS

REAL-TIME INCREMENTREAL-TIME INCREMENT

BATCH VIEWSBATCH VIEWSDATA

SOURCEDATA

SOURCE MERGED VIEW

MERGED VIEW

PRECOMPUTE VIEWSPRECOMPUTE VIEWSALL DATAALL DATA

INCREMENT VIEWSINCREMENT VIEWSPROCESS STREAMPROCESS STREAM

REAL-TIME DATAREAL-TIME DATA

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

PARTIAL AGGREGATE

BATCH LAYER

SPEED LAYER

SERVING LAYER

MERGE

Δ

ANALYTICSANALYTICS

APPLICATIONAPPLICATION

Page 34: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 34

Lambda Architecture

Views• This is very application specific

– Aggregates might fit in a regular SQL table– The merged view might need some custom work

• Implementations range from SQL to NoSQL– Postgres, MS-SQL– Hbase (with Phoenix)– Cassandra– Elastic– …

Page 35: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 35

Lambda Architecture

Lambda + Spark = profit• “Code duplication” biggest criticism• Spark offers batch and streaming paradigms• At least event interpretation can be shared

Page 36: MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

© 2018 MapR TechnologiesMapR Confidential 36

Learn more: