View
217
Download
0
Category
Tags:
Preview:
Citation preview
Map Reduce & Hadoop
June 3, 2015HS Oh, HR Lee, JY Choi
YS Lee, SH Choi
2
Outline
Part1 Introduction to Hadoop MapReduce Tutorial with Simple Example Hadoop v2.0: YARN
Part2 MapReduce Hive Stream Data Processing: Storm Spark Up-to-date Trends
3
MapReduce
Overview Task flow Shuffle configurables Combiner Partitioner
Custom Partitioner Example Number of Maps and Reduces How to write MapReduce functions
4
MapReduce Overview
http://www.micronautomata.com/big_data
AA
A
AA
A
B
B
B
B
B
B
5
MapReduce Task flow
http://grepalex.com/2012/09/10/sorting-text-files-with-mapreduce/
6
MapReduce Shuffle Configurables
http://grepalex.com/2012/11/26/hadoop-shuffle-configurables/
7
Combiner Mini Reducer Functionally same as the reducer Performs on each map task(locally), reduces communication cost Using combiner when Reduce function is both commutative and associative
http://www.kalyanhadooptraining.com/2014_07_01_archive.html
8
Partitioner
Divides Map’s output key, value pair by rule Default strategy is hashing
HashPartitioner
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {public void configure(JobConf job) {}public int getPartition(K2 key, V2 value, int numReduceTasks) return (key.hashCode() & Integer.
MAX_VALUE) % numReduceTasks;}
}
9
Custom Partitioner Example
Input with name, age, sex, and score Map outputs divide by range of age
public static class AgePartitioner extends Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { String [] nameAgeScore = value.toString().split("\t"); String age = nameAgeScore[1]; int ageInt = Integer.parseInt(age); //this is done to avoid performing mod with 0 if(numReduceTasks == 0) return 0; //if the age is <20, assign partition 0 if(ageInt <=20){ return 0; } //else if the age is between 20 and 50, assign partition 1 if(ageInt >20 && ageInt <=50){ return 1 % numReduceTasks; } //otherwise assign partition 2 else return 2 % numReduceTasks; } }
http://hadooptutorial.wikispaces.com/Custom+partitioner
10
Number of Maps and Reduces
The number of Maps = DFS blocks To adjust DFS block size to adjust the number of maps Right level of parallelism for maps → 10~100 maps/node mapred.map.tasks parameter is just a hint
The number of Reduces Suggested values
Set # of reduce tasks a little bit less than # of total slot A task time between 5 and 15 min Create the fewest files possible
conf.setNumReduceTasks(int num)
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
11
How to write MapReduce functions [1/2]
Java Word Count Example
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }
Input part Output part
Input part Output part
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
12
How to write MapReduce functions [2/2]
Python Word Count Example
Mapper.py
#!/usr/bin/python import sys for line in sys.stdin: for word in line.strip().split(): print "%s\t%d" % (word, 1)
How to excute
bin/Hadoop jar share/Hadoop/tools/lib/Hadoop-streaming-2.4.0.jar-files /home/hduser/Mapper.py, /home/hduser/Reduc-er.py-mapper /home/hduser/Mapper.py-reducer /home/hduser/Reducer.py-input /input/count_of_monte_cristo.txt-output /output
Reducer.py
#!/usr/bin/python import sys current_word = Nonecurrent_count = 1 for line in sys.stdin: word, count = line.strip().split('\t') if current_word: if word == current_word: current_count += int(count) else: print "%s\t%d" % (current_word, current_count) current_count = 1 current_word = word if current_count > 1: print "%s\t%d" % (current_word, current_count)
http://dogdogfish.com/2014/05/19/hadoop-wordcount-in-python/
13
Hive &Stream Data Processing: Storm
Hadoop Ecosystem
14
The World of Big Data Tools
DAG Model
ForIterations /
Learning
ForQuery
ForStreaming
MapReduce Model
Graph ModelBSP / Collective
Model
Hadoop
MPI
HaLoop
Twister
Spark
Harp
Flink
REEFDryad / DryadLIN
Q Pig / PigLatin
Hive
Tez
SparkSQL(Shark)
MRQL
S4 Storm
Samza Spark Streaming
Drill
Giraph
Hama
GraphLab
GraphX
From Bingjing Zhang
15
Hive
Data warehousing on top of Hadoop
Designed to enable easy data summarization ad-hoc querying analysis of large volumes of data
HiveQL statements are automatically translated into MapReduce jobs
16
Advantages
Higher level query language Simplifies working with large amounts of data
Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial and error than Pig
17
Disadvantages
Updating data is complicated Mainly because of using HDFS Can add records Can overwrite partitions
No real time access to data Use other means like HBase or Impala
High latency
18
Hive Architecture
19
Metastore
20
Parser Semantic Analyzer Logical Plan Generator Query Plan Generator
Compiler
21
Hive Architecture
22
While based on SQL, HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL, including multitable inserts and cre-ate table as select, but only offers basic support for indexes.
HiveQL lacks support for transactions and materialized views, and only limited subquery support.
Support for insert, update, and delete with full ACID functionality was made available with release 0.14.
HiveQL
23
Datatypes in Hive
Primitive datatypes TINYINT SMALLINT INT BIGINT BOOLEAN FLOAT DOUBLE STRING
24
HiveQL – Group By
• HiveQL : INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;
pageid
age
1 25
2 25
1 32
2 25
3 27
1 21
… …
… …
18570
30
18570
26
pv_users
pageid
age Count
1 25 1
1 32 1
1 21 1
2 25 2
3 27 1
… … …
… … …
18570
30 1
18570
26 1
pageid_age_sum
25
HiveQL – Group By in MapReduce
pageid
age
1 25
2 25
1 32
pageid
age Count
1 25 1
1 32 1
1 21 1
pageid
age
2 25
3 27
1 21
key value
<1,25>
1
<2,25>
1
<1,32>
1key valu
e
<2,25>
1
<3,27>
1
<1,21>
1
key value
<1,25>
1
<1,32>
1
<1,21>
1key value
<2,25>
1
<2,25>
1
2 25 2
pageid
age
18570 30
18570 26
…
key value
<18570,30>
1
<18570,26>
1
…key value
<18570,30>
1
<18570,26>
1
key value
<3,27>
1…
3 27 1
18570
30 1
18570
26 1
…
Map Shuffle Reduce
26
Stream Data Processing
27
Distributed Stream Processing Engine
Stream data Unbounded sequence of event tuples E.g., sensor data, stock trading data, web traffic data, …
Since large volume of data flows from many sources, centralized sys-tems can no longer process in real time.
28
Distributed Stream Processing Engine
General Stream Processing Model Stream processing involves processing data before storing.
c.f. Batch systems(like Hadoop) provide processing data after stor-ing.
Processing Element (PE): A processing unit in stream engine
Generally stream processing engine creates a logical network of stream processing elements(PE) connected in directed acyclic graph(DAG).
29
Distributed Stream Processing Engine
30
DSPE Systems
Apache Storm (Current release: 0.10) Developed by Twitter Donated to Apache Software Foundation in 2013 Pull based messaging http://storm.apache.org/
Apache S4 (Current release: 0.6) Developed by Yahoo Donated to Apache Software Foundation in 2011 S4 stands for Simple Scalable Streaming Systems Push based messaging http://incubator.apache.org/s4/
Apache Samza (Current release: 0.9) Developed by LinkedIn Donated to Apache Software Foundation in 2013 Messaging using message broker(Kafka) http://samza.apache.org/
31
Apache Storm
System Architecture
32
Apache Storm
Topology A PE DAG on Storm Spout: Starting point of data stream can be listening to HTTP port or pulling
from queue Bolt: Process incoming stream tuple Bolt pulls message from upstream PE.
Bolts don’t take excessive amount of messages.
Stream grouping Shuffle grouping, fields grouping, partial key grouping, all grouping,
global grouping, …
Message Processing Guarantee Each PE keeps the output message until downstream PE
processes the message and sends acknowledgement message.
33
Apache Storm: Spouts
Tuple Tuple Tuple Tuple Tuple
Source of streams
Tuple Tuple Tuple Tuple Tuple
34
Apache Storm: Bolts
TupleTuple
TupleTuple
Tuple Tuple Tuple TupleTuple Tuple Tuple Tuple
Processes input streams and produces new streams
35
Apache Storm: Topology
Network of spouts and bolts
36
Apache Storm: Task
Spouts and bolts execute as many tasks across the cluster
37
Apache Storm: Stream grouping
Shuffle grouping: pick a random task
Fields grouping: consistent hashing on a subset of tuple fields
All grouping: send to all tasks
Global grouping: pick task with lowest id
38
Apache Storm
Supported language Python, Java, Clojure
Tutorial
Bolt ‘exclaim1’ appends the string “!!” to its input.Bolt ‘exclaim2’ appends the string “**” to its input.
39
Apache Storm
JohnBob
Rice
JohnBob
Rice
John!!B
ob!!R
ice!!
John** Bob** Rice**Rice!!**Bob!!**John!!**
exclaim1
exclaim2
word
40
References
1. Apache Hive, https://hive.apache.org/
2. Design - Apache Hive, https://cwiki.apache.org/confluence/display/Hive/Design
3. Apache Storm, https://storm.apache.org/
41
SparkFast, Interactive, Language-Integrated Cluster Computing
42
Motivation
Most current cluster programming models are based on acyclic data flow from stable storage to stable storage
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from fail-ures
Map
Map
Map
Reduce
Reduce
Input Output
43
Motivation
Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python)
With such frameworks, apps reload data from stable storage on each query
44
Solution: Resilient Distributed Datasets(RDDs)
Allow apps to keep working sets in memory for effi-cient reuse
Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability
Support a wide range of applications Batch, Query processing, Stream processing,
Graph processing, Machine learning
45
RDD Operations
Transformations(define a new RDD)
mapfilter
samplegroupByKeyreduceByKey
sortByKey
flatMapunionjoin
cogroupcross
mapValues
Actions(return a result to driver
program)
collectreducecountsave
lookupKey
46
Example: Log Mining
Load error messages from a log into memory, then interac-tively search for various patterns
lines = sc.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)
47
RDD Fault Tolerance
RDDs maintain lineage information that can be used to re-construct lost partitions
messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))
HDFS File Filtered RDD Mapped RDDfilter
(func = _.contains(...))map
(func = _.split(...))
48
Performance
Logistic Regression
https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html
49
Fault Recovery
Run K-means on 75-node cluster Each iteration consists of 400 tasks working on 100GB data RDD is reconstructed by using lineage
Recovery overhead: 24s (≈ 30%) Lineage graph: ≤10KB
Matei et al, Resilient Distributed Datasets, NSDI `12
50
Generality
Various type of applications can be built atop RDD
Can be combined in single application and run on Spark Runtime
http://spark.apache.org
51
Interactive Analytics
Interactive shell is provided Program returns the result directly Run ad-hoc queries
52
Demo
WordCount in Scala API
Show the result on the shell counts.saveAsTextFile() → counts.collect()
53
Conclusion
Performance Fast due to caching data in memory
Fault-tolerance Fast recovery by using lineage history
Programmability Multiple languages support Simple & Integrated programming model
54
Up-to-date Trends
55
Up-to-date Trends
Batch + Real-time Analytics
Big-Data-as-a-Service
56
Trend1: Batch + Real-time Analytics
Lambda Architecture
1. Data Dispatched to both the
batch layer and the speed layer
2. Batch layer Manage the master
dataset (an immutable, append-only set of raw data)
Pre-compute the batch views.
57
Trend1: Batch + Real-time Analytics
Lambda Architecture
3. Serving layer Index the batch views Can be queried in low-la-
tency, ad-hoc way.
4. Speed layer Deals with recent data
only (serving layer’s up-date cost is high)
5. Merge results from batch views and real-time views when answering queries.
58
Trend2: Big-Data-as-a-Service
Big-Data-as-a-Service Big data analytics systems are provided as Cloud
serviceProgramming API & Monitoring interfaceInfrastructure can also be provided as a service
No worry for distributing data, resource optimiza-tion, resource provision, etc.Users can focus on the data itself
59
Trend2: Big-Data-as-a-service
Google Cloud Dataflow
<Programming API> <Monitoring UI>
60
References
1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI`12
2. Apache Spark, http://spark.apache.org
3. Databricks, http://www.databricks.com
4. Lambda Architecture, http://lambda-architecture.net
5. Google Cloud Dataflow http://cloud.google.com/dataflow
61
Thank youQuestions?
Recommended