Upload
jworks-powered-by-ordina
View
382
Download
0
Embed Size (px)
Citation preview
Big Data
Hadoop: A Tour of the Zoo
Content
▪ Setup
▪ Introduction
▪ Who Uses it
▪ How does it work
▪ What was new in hadoop 2
▪ Eco-System
▪ Hadoop Distributions
▪ Practical: meeting the shell
▪ Demo: Pig and Hive
Setup
1. Go to http://hortonworks.com/products/hortonworks-
sandbox/#install and download the 2.2.4
2. Install VirtualBox
https://www.virtualbox.org/wiki/Downloads
3. Remarks:
Chrome might corrupt the download, try safari, firefox or IE.
Introduction: This workshop goals
▪ Internal working
▪ HDFS
▪ Eco System
▪ Shell
▪ Upcoming workshops
Who Uses It
▪ Yahoo
▪ Port of Rotterdam
▪ Spotify
Who Uses It: Yahoo
Who Uses It: Yahoo
Who Uses It: Yahoo
Who Uses It: Yahoo
Who Uses It: Pinterest
Who Uses It: Pinterest
Who Uses It: Pinterest
Internals: Port of Rotterdam
Internals: Port of Rotterdam
Internals: Port of Rotterdam
Internals: Spotify
Internals: Spotify
Internals: Spotify
▪ Reporting record labels and rights holders
▪ Creating toplists and what is most popular music right now
▪ Ad analysis
▪ Intelligent radio and discovery features
How Does It Work
▪ HDFS: Intro
▪ Assumptions and Goals
▪ Key Features
▪ How does it work
▪ MapReduce
▪ Eco System
How Does It Work: HDFS
How Does It Work: Assumptions And Goals
How Does It Work: Assumptions And Goals
How Does It Work: Assumptions And Goals
How Does It Work: Assumptions And Goals
How Does It Work: Assumptions And Goals
How Does It Work: Assumptions And Goals
How Does It Work: Key Features
Rack Awareness
Minimal Data Motion
How Does It Work: Key Features
Utilities
Rollback
How Does It Work: Key Features
Standby NameNode
Operability
How Does It Work: HDFS Architecture
How Does It Work: Client Reading Files
How Does It Work: MapReduce - server roles
How Does It Work: MapReduce
split 1 ABCA
AB
CA
Ab
Bc
AC
cD
split 2 AbBc
split 3 ACcd
MAP
MAP
MAP
A, 1
A, 1
B, 1
C, 1
A, 1
B, 1
B, 1
C, 1
A, 1
C, 1
C, 1
D, 1
Reducer
Reducer
A, 4
B, 3
C, 4
D, 1
1 2 3 4 5
A, 4
B, 3
C, 4
D, 1
Hadoop 1 vs 2
Hadoop 1 vs 2: Federation
Hadoop 1 vs 2: High Availability
Eco-System
HDFS
YARN
MAPREDUCETEZSPARK
HBASEHIVE
HCATALOG
PIG Mahout
SCOOP
Flume
Zookeeper
ORC
Crunch
Oozie
DRILL
STORMD
A
T
A
Curator
KAFKA
Eco-System: Pig
● Map Reduce
● Directed
Acyclic Graph
● Analyze
● Pig Latin
Eco-System: Hive
● HiveQL (SQL)
● Map Reduce
● Analyze
● Schema on Read
Eco-System: HCatalog
● Part of Hive
● REST services
● Table
Management
● Relational View
Eco-System: ORC
● Store hive data
● Metadata in file
● File Format
● Compression
Optimized
Row
Columnar
Eco-System
HDFS
MAPREDUCE
HIVE
HCATALOG
PIG
ORC
Eco-System: Mahout
● Data Mining
● MapReduce
● Machine Learning
● Distributed
Eco-System: Zookeeper
● Ordered
● Centralized
Service
● Reliability
● Fast
Eco-System: Curator
● Recipes
● Simplifies
Zookeeper
● Made by Netflix
Eco-System: YARN
● Resource
Manager
● MapReduce 2.0
● Multiple Data
Processing
Options
Yet
Another
Resource
Negiotiator
Eco-System: YARN
Eco-System
HDFS
YARN
MAPREDUCE
HIVE
HCATALOG
PIG Mahout
Zookeeper
ORC
Curator
Eco-System: Oozie
● Oozie
Workflow
● Workflow
Scheduler
● Oozie
Coordinator
● Oozie Bundle
Eco-System: Tez
● Dataflow Graph
● Improves Map
Reduce
● Dynamically
Reconfigure
Eco-System: Sqoop
● Data Imports /
Exports RDBMS
● Java PoJo
Eco-System: Hbase
● Fast
● Fault Tolerant
● Usable
● Use Cases
Eco-System
HDFS
YARN
MAPREDUCETEZ
HBASEHIVE
HCATALOG
PIG Mahout
SCOOP
Zookeeper
ORC
OozieCurator
Eco-System: Crunch
● Developer
Focused
● Pipeline
● Flexible Data
Model
Eco-System: Drill
● Schema-Free
JSON Model
● Query Any
Datastore
● SQL (SQL:2003
syntax)
Eco-System: Storm
● No Data-Loss
● Stream
Processing
● Scalable
Eco-System: Flume
● Buffer
Incoming Data
● Stream Data
● Guarantee Data
Delivery
● Scalable
Eco-System
HDFS
YARN
MAPREDUCETEZ
HBASEHIVE
HCATALOG
PIG Mahout
SCOOP
Flume
Zookeeper
ORC
Crunch
Oozie
DRILL
STORMD
A
T
A
Curator
Eco-System: Spark
● More then
Map/Reduce
● Memory
● Java, Scala,
Python and R
● SQL
Eco-System: Spark
Eco-System: Spark
Eco-System: Kafka
● Scalable
● Fast
● Durable
● Distributed By
Design
Eco-System
HDFS
YARN
MAPREDUCETEZSPARK
HBASEHIVE
HCATALOG
PIG Mahout
SCOOP
Flume
Zookeeper
ORC
Crunch
Oozie
DRILL
STORMD
A
T
A
Curator
KAFKA
Hadoop Distributions
Hands on - HDFS
HDFS: Many ways of input
HDFS
HDFS Client
Very POSIX (UNIX) likehadoop fs -put
-get
-mkdir
-ls
-cp
-mv
-rm
-chmod
...
Hands on - HDFS: Objectives
▪ Working with the HDFS Client
▪ Find where blocks are stored
Hands on - Pig & Hive Preview
Upcoming Workshops
▪ September: A visit to the animal farm: Babe eh … Pig
▪ October: Bee a master - Hive
▪ November: Streaming (Storm & Flume)
Questions or Suggestions?