Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

Map Reduce & Hadoop

June 3, 2015HS Oh, HR Lee, JY Choi

YS Lee, SH Choi

Outline

Part1 Introduction to Hadoop MapReduce Tutorial with Simple Example Hadoop v2.0: YARN

Part2 MapReduce Hive Stream Data Processing: Storm Spark Up-to-date Trends

MapReduce

Overview Task flow Shuffle configurables Combiner Partitioner

Custom Partitioner Example Number of Maps and Reduces How to write MapReduce functions

MapReduce Overview

http://www.micronautomata.com/big_data

MapReduce Task flow

http://grepalex.com/2012/09/10/sorting-text-files-with-mapreduce/

MapReduce Shuffle Configurables

http://grepalex.com/2012/11/26/hadoop-shuffle-configurables/

Combiner Mini Reducer Functionally same as the reducer Performs on each map task(locally), reduces communication cost Using combiner when Reduce function is both commutative and associative

http://www.kalyanhadooptraining.com/2014_07_01_archive.html

Partitioner

Divides Map’s output key, value pair by rule Default strategy is hashing

HashPartitioner

public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {public void configure(JobConf job) {}public int getPartition(K2 key, V2 value, int numReduceTasks) return (key.hashCode() & Integer.

MAX_VALUE) % numReduceTasks;}

Custom Partitioner Example

Input with name, age, sex, and score Map outputs divide by range of age

public static class AgePartitioner extends Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { String [] nameAgeScore = value.toString().split("\t"); String age = nameAgeScore[1]; int ageInt = Integer.parseInt(age); //this is done to avoid performing mod with 0 if(numReduceTasks == 0) return 0; //if the age is <20, assign partition 0 if(ageInt <=20){ return 0; } //else if the age is between 20 and 50, assign partition 1 if(ageInt >20 && ageInt <=50){ return 1 % numReduceTasks; } //otherwise assign partition 2 else return 2 % numReduceTasks; } }

http://hadooptutorial.wikispaces.com/Custom+partitioner

Number of Maps and Reduces

The number of Maps = DFS blocks To adjust DFS block size to adjust the number of maps Right level of parallelism for maps → 10~100 maps/node mapred.map.tasks parameter is just a hint

The number of Reduces Suggested values

Set # of reduce tasks a little bit less than # of total slot A task time between 5 and 15 min Create the fewest files possible

conf.setNumReduceTasks(int num)

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

How to write MapReduce functions [1/2]

Java Word Count Example

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

Input part Output part

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

How to write MapReduce functions [2/2]

Python Word Count Example

Mapper.py

#!/usr/bin/python import sys for line in sys.stdin: for word in line.strip().split(): print "%s\t%d" % (word, 1)

How to excute

bin/Hadoop jar share/Hadoop/tools/lib/Hadoop-streaming-2.4.0.jar-files /home/hduser/Mapper.py, /home/hduser/Reduc-er.py-mapper /home/hduser/Mapper.py-reducer /home/hduser/Reducer.py-input /input/count_of_monte_cristo.txt-output /output

Reducer.py

#!/usr/bin/python import sys current_word = Nonecurrent_count = 1 for line in sys.stdin: word, count = line.strip().split('\t') if current_word: if word == current_word: current_count += int(count) else: print "%s\t%d" % (current_word, current_count) current_count = 1 current_word = word if current_count > 1: print "%s\t%d" % (current_word, current_count)

http://dogdogfish.com/2014/05/19/hadoop-wordcount-in-python/

Hive &Stream Data Processing: Storm

Hadoop Ecosystem

The World of Big Data Tools

DAG Model

ForIterations /

Learning

ForQuery

ForStreaming

MapReduce Model

Graph ModelBSP / Collective

Hadoop

HaLoop

Twister

REEFDryad / DryadLIN

Q Pig / PigLatin

SparkSQL(Shark)

S4 Storm

Samza Spark Streaming

Giraph

GraphLab

GraphX

From Bingjing Zhang

Data warehousing on top of Hadoop

Designed to enable easy data summarization ad-hoc querying analysis of large volumes of data

HiveQL statements are automatically translated into MapReduce jobs

Advantages

Higher level query language Simplifies working with large amounts of data

Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial and error than Pig

Disadvantages

Updating data is complicated Mainly because of using HDFS Can add records Can overwrite partitions

No real time access to data Use other means like HBase or Impala

High latency

Hive Architecture

Metastore

Parser Semantic Analyzer Logical Plan Generator Query Plan Generator

Compiler

Hive Architecture

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard.

HiveQL offers extensions not in SQL, including multitable inserts and cre-ate table as select, but only offers basic support for indexes.

HiveQL lacks support for transactions and materialized views, and only limited subquery support.

Support for insert, update, and delete with full ACID functionality was made available with release 0.14.

HiveQL

Datatypes in Hive

Primitive datatypes TINYINT SMALLINT INT BIGINT BOOLEAN FLOAT DOUBLE STRING

HiveQL – Group By

• HiveQL : INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

pageid

… …

pv_users

pageid

age Count

1 25 1

1 32 1

1 21 1

2 25 2

3 27 1

… … …

pageid_age_sum

HiveQL – Group By in MapReduce

pageid

age Count

1 25 1

1 32 1

1 21 1

pageid

key value

<1,25>

<2,25>

<1,32>

1key valu

<2,25>

<3,27>

<1,21>

key value

<1,25>

<1,32>

<1,21>

1key value

<2,25>

2 25 2

pageid

18570 30

18570 26

key value

<18570,30>

<18570,26>

…key value

<18570,30>

<18570,26>

key value

<3,27>

3 27 1

Map Shuffle Reduce

Stream Data Processing

Distributed Stream Processing Engine

Stream data Unbounded sequence of event tuples E.g., sensor data, stock trading data, web traffic data, …

Since large volume of data flows from many sources, centralized sys-tems can no longer process in real time.

General Stream Processing Model Stream processing involves processing data before storing.

c.f. Batch systems(like Hadoop) provide processing data after stor-ing.

Processing Element (PE): A processing unit in stream engine

Generally stream processing engine creates a logical network of stream processing elements(PE) connected in directed acyclic graph(DAG).

DSPE Systems

Apache Storm (Current release: 0.10) Developed by Twitter Donated to Apache Software Foundation in 2013 Pull based messaging http://storm.apache.org/

Apache S4 (Current release: 0.6) Developed by Yahoo Donated to Apache Software Foundation in 2011 S4 stands for Simple Scalable Streaming Systems Push based messaging http://incubator.apache.org/s4/

Apache Samza (Current release: 0.9) Developed by LinkedIn Donated to Apache Software Foundation in 2013 Messaging using message broker(Kafka) http://samza.apache.org/

Apache Storm

System Architecture

Apache Storm

Topology A PE DAG on Storm Spout: Starting point of data stream can be listening to HTTP port or pulling

from queue Bolt: Process incoming stream tuple Bolt pulls message from upstream PE.

Bolts don’t take excessive amount of messages.

Stream grouping Shuffle grouping, fields grouping, partial key grouping, all grouping,

global grouping, …

Message Processing Guarantee Each PE keeps the output message until downstream PE

processes the message and sends acknowledgement message.

Apache Storm: Spouts

Tuple Tuple Tuple Tuple Tuple

Source of streams

Tuple Tuple Tuple Tuple Tuple

Apache Storm: Bolts

TupleTuple

Tuple Tuple Tuple TupleTuple Tuple Tuple Tuple

Processes input streams and produces new streams

Apache Storm: Topology

Network of spouts and bolts

Apache Storm: Task

Spouts and bolts execute as many tasks across the cluster

Apache Storm: Stream grouping

Shuffle grouping: pick a random task

Fields grouping: consistent hashing on a subset of tuple fields

All grouping: send to all tasks

Global grouping: pick task with lowest id

Apache Storm

Supported language Python, Java, Clojure

Tutorial

Bolt ‘exclaim1’ appends the string “!!” to its input.Bolt ‘exclaim2’ appends the string “**” to its input.

Apache Storm

JohnBob

John!!B

John** Bob** Rice**Rice!!**Bob!!**John!!**

exclaim1

exclaim2

References

1. Apache Hive, https://hive.apache.org/

2. Design - Apache Hive, https://cwiki.apache.org/confluence/display/Hive/Design

3. Apache Storm, https://storm.apache.org/

SparkFast, Interactive, Language-Integrated Cluster Computing

Motivation

Most current cluster programming models are based on acyclic data flow from stable storage to stable storage

Benefits of data flow: runtime can decide where to run tasks and can automatically recover from fail-ures

Reduce

Input Output

Motivation

Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python)

With such frameworks, apps reload data from stable storage on each query

Solution: Resilient Distributed Datasets(RDDs)

Allow apps to keep working sets in memory for effi-cient reuse

Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability

Support a wide range of applications Batch, Query processing, Stream processing,

Graph processing, Machine learning

RDD Operations

Transformations(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result to driver

program)

collectreducecountsave

lookupKey

Example: Log Mining

Load error messages from a log into memory, then interac-tively search for various patterns

lines = sc.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

results

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

RDD Fault Tolerance

RDDs maintain lineage information that can be used to re-construct lost partitions

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDDfilter

(func = _.contains(...))map

(func = _.split(...))

Performance

Logistic Regression

https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html

Fault Recovery

Run K-means on 75-node cluster Each iteration consists of 400 tasks working on 100GB data RDD is reconstructed by using lineage

Recovery overhead: 24s (≈ 30%) Lineage graph: ≤10KB

Matei et al, Resilient Distributed Datasets, NSDI `12

Generality

Various type of applications can be built atop RDD

Can be combined in single application and run on Spark Runtime

http://spark.apache.org

Interactive Analytics

Interactive shell is provided Program returns the result directly Run ad-hoc queries

WordCount in Scala API

Show the result on the shell counts.saveAsTextFile() → counts.collect()

Conclusion

Performance Fast due to caching data in memory

Fault-tolerance Fast recovery by using lineage history

Programmability Multiple languages support Simple & Integrated programming model

Up-to-date Trends

Batch + Real-time Analytics

Big-Data-as-a-Service

Trend1: Batch + Real-time Analytics

Lambda Architecture

1. Data Dispatched to both the

batch layer and the speed layer

2. Batch layer Manage the master

dataset (an immutable, append-only set of raw data)

Pre-compute the batch views.

Trend1: Batch + Real-time Analytics

Lambda Architecture

3. Serving layer Index the batch views Can be queried in low-la-

tency, ad-hoc way.

4. Speed layer Deals with recent data

only (serving layer’s up-date cost is high)

5. Merge results from batch views and real-time views when answering queries.

Trend2: Big-Data-as-a-Service

Big-Data-as-a-Service Big data analytics systems are provided as Cloud

serviceProgramming API & Monitoring interfaceInfrastructure can also be provided as a service

No worry for distributing data, resource optimiza-tion, resource provision, etc.Users can focus on the data itself

Trend2: Big-Data-as-a-service

Google Cloud Dataflow

References

1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI`12

2. Apache Spark, http://spark.apache.org

3. Databricks, http://www.databricks.com

4. Lambda Architecture, http://lambda-architecture.net

5. Google Cloud Dataflow http://cloud.google.com/dataflow

Thank youQuestions?

Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi YS Lee, SH Choi

Documents

Survivable Logical Topology Design in WDM Optical Ring Networks Hwajung Lee, Hongsik Choi, Suresh Subramaniam, and Hyeong-Ah Choi* The George Washington

The Liver Week 2015 new - ::: 대한간학회 ::: · Seunghwan Lee, Wontae Cho, Byunggon Na, Chanwoo Joh, Nuri Lee, Gyuseong Choi, ... Jai Sun Lee, Dae Won Jun, Hyeon Tae Kang, Ki

Suk Choi , Kang Seog Lee Chonnam National University

Woongjin Coway Water Purifier CHOI, Jaehoon LEE, hyojung Kim, yejin 1

International Conference on Advanced Communications ... Network Abnormal Behaviour Analysis System Sunoh Choi, Yangseo Choi, Jooyoung Lee, ... InfiniFlux database [3] and the network

ELearning Solution Virtual Patient: Year 2 Dr. Sunhea Choi and Mimi Lee 16 July 2009

Schlaich’s Use of Cable-stays By Avery Choi & Shaun Lee

Sungwon Choi, Mike Lee, Amy L. Shiu, Sek Jin Yo and ... · Sungwon Choi, Mike Lee, Amy L. Shiu, Sek Jin Yo, and Gregory W. Aponte Department of Nutritional Sciences and Toxicology,

ISSN: 2233-601X (Print) ISSN: 2093-6516 (Online) A Case of ...€¦ · 5.Lee SY, Oh JY, Lee SJ, Lee CS. A modified technique for pectus carinatum surgery: partial costal cartilage

FK506 reduces calpain-regulated calcineurin activity in ... · Sun Hee Lee, Jungil Choi, Hwajin Kim, Dong Hoon Lee, Gu Seob Roh, Hyun Joon Kim, Sang Soo Kang, Wan Sung Choi, Gyeong

Soojin Choi, Seong Woo Kim, Ha Ra Jeon, June Sung Lee ... · Soojin Choi , Seong Woo Kim , Ha Ra Jeon , June Sung Lee , Dong Yeong Kim , Jang Woo Lee Department of Physical Medicine

By Christi Choi, Matt Brenner, Paul Lee, and Rachel Baughman

Dustin Miller, Jimmy Lee, Aileen Jiang, Patrick Huang, Leo Choi Period 1

You Joung Heo , Jae Ho Yoo Jung Yoon Choi , Young Ah Lee

Exporting to Japan - Lilly Choi-Lee

معاونت توسعه پژوهش و فنآوريproposal.ajums.ac.ir/Files/Research/4136-2015225141511.docx · Web view9-Park YK, Choi JY, Jung SI, Park KH, Lee H, Jung DS, et al

Youngsun Lee1,2†, Yoori Choi 1†*, Eun-Joo Park3,4 ...Jan 24, 2020 · † Youngsun Lee, Yoori Choi and Eun-Joo Park contributed equally to this investigation. available under

Youn · Lee · Choi Yeo-Kyu Youn Kyu Eun Lee June Young Choi ...extras.springer.com/2014/978-3-662-52427-5/978-3...Color Atlas of Thyroid Surgery Yeo-Kyu Youn Kyu Eun Lee June Young

LEE CHOI YAN (1155067477) - App sharing "Apple Daily Mobile"

ksabc.or.krksabc.or.kr/newsletter/20190628/ABCH_2017_2018... · Web view. Soo Yeon Choi, Jaemin Lee, Dong Gu Lee, Sanghyun Lee, Eun Ju Cho. Acer okamotoanum. improves cognition