Njug presentation

01-‐1 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

Hadoop 101: WriCng a Java MapReduce Program Ian Wrigley Sr. Curriculum Manager, Cloudera [email protected] | @iwrigley


And, by the way, what is Hadoop?

Why the World Needs Hadoop


§ Every day… – More than 1.5 billion shares are traded on the NYSE – Facebook stores 2.7 billion comments and Likes

§ Every minute… – Foursquare handles more than 2,000 check-‐ins – TransUnion makes nearly 70,000 updates to credit files

§ And every second… – Banks process more than 10,000 credit card transacCons

Volume


§ We are genera;ng data faster than ever – Processes are increasingly automated – People are increasingly interacCng online – Systems are increasingly interconnected

Velocity


§ We’re producing a variety of data, including – Audio – Video – Images – Log files – Web pages – Product raCng comments – Social network connecCons

§ Not all of this maps cleanly to the rela;onal model

Variety


§ One tweet is an anecdote – But a million tweets may signal important trends

§ One person’s product review is an opinion – But a million reviews might uncover a design flaw

§ One person’s diagnosis is an isolated case – But a million medical records could lead to a cure

Big Data Can Mean Big Opportunity


A Scalable Data Processing Framework

MapReduce


§ MapReduce is a programming model – It’s a way of processing data

§ In Hadoop, you supply two func;ons to process data: Map and Reduce – Map: typically used to transform, parse, or filter data – Reduce: typically used to summarize results

§ The Map func;on always runs first – The Reduce funcCon runs acerwards – The Hadoop framework performs a shuffle and sort to transfer data from the Map funcCon to the Reduce funcCon

§ Each piece is simple, but can be powerful when combined

What is MapReduce?


§ … in which Ian waves his hands around and aRempts to explain the MapReduce flow

MapReduce: An Example


§ MapReduce processing in Hadoop is batch-‐oriented

§ Usually wriRen in Java – This uses Hadoop’s API directly – You can do basic MapReduce in other languages

– Using the Hadoop Streaming wrapper program – Some advanced features require Java code

MapReduce Code for Hadoop


§ Some (very) basic concepts: – Input and output data is typed – The framework passes each input record to the Mapper in turn – A record is a (key, value) pair – For text files:

– The key is the byte offset of the start of the line – The value is the line itself

– Output data from the Mapper is transferred to the Reducer via a process known as the shuffle and sort – Reducers receive (key, Iterable of values) sets, in sorted key order – Job is configured and executed using a driver class

Basic Java API Concepts


Data Flow

Map input

Map output Reduce input Reduce output

Shuffle and sort

Nashville J. Jones 12.95 2013-07-21 Memphis S. Smith 66.57 2013-07-21 Nashville T. Harding 55.35 2013-07-22 Knoxville S. Warne 10.99 2013-07-22 Kingsport M. Thompson 99.95 2013-07-22

Nashville 12.95 Memphis 66.57 Nashville 55.35 Knoxville 10.99 Kingsport 99.95

Kingsport[99.95] Knoxville[10.99] Memphis [66.57] Nashville[12.95, 55.35]

Kingsport 99.95 Knoxville 10.99 Memphis 66.57 Nashville 68.30


Java MR Job Example: Mapper

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;

public class StoreSalesMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {

1 2 3 4 5 6 7 8 9

10

Input key and value types

Output key and value types



/* * The map method is invoked once for each line of text in the * input data. The method receives a key of type LongWritable * (which corresponds to the byte offset in the current input * file), a value of type Text (representing the line of input * data), and a Context object (which allows us to print status * messages, among other things). */ @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

11 12 13 14 15 16 17 18 19 20 21 22 23



String line = value.toString(); // ignore empty lines if (line.trim().isEmpty()) { return; } String[] fields = line.split("\t"); // ensure this line is not malformed if (fields.length != 4) { return; }

24 25 26 27 28 29 30 31 32 33 34 35 36

Convert value to a Java String

Defensive programming!

Split record into fields

Even more defensive programming!



String storeName = fields[0]; Double saleValue = Double.parseDouble(fields[2]); context.write(new Text(storeName), new DoubleWritable(saleValue)); } }

37 38 39 40 41 42 43 44 45 46 47

Output key and value

Extract based on posiCon


Java MR Job Example: Reducer

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {

1 2 3 4 5 6 7 8 9

10

Output key and value types

Input key and value types



/* * The reduce method is invoked once for each key received from * the shuffle and sort phase of the MapReduce framework. * The method receives a key of type Text (representing the key), * a set of values of type DoubleWritable, and a Context object. */ @Override public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {

11 12 13 14 15 16 17 18 19



// used to sum up the store sales double sum = 0; // add to it it for each new value received for (DoubleWritable value : values) { sum += value.get(); }

// Our output is the event type (key) and the sum (value) context.write(key, new DoubleWritable(sum)); }

}

20 21 22 23 24 25 26 27 28 29 30 31

Output key and value


Java MR Job Example: Driver

package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class StoreSales { public static void main(String[] args) throws Exception {

1 2 3 4 5 6 7 8 9

10 11 12 13



// validate command line arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>\n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));

14 15 16 17 18 19 20 21 22 23 24 25 26



// tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(StoreSales.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Store Sale Aggregator");

// Specify which classes to use for the Mapper and Reducer job.setMapperClass(StoreSalesMapper.class); job.setReducerClass(SumReducer.class);

27 28 29 30 31 32 33 34 35 36 37



// specify the Mapper's output key and value classes job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(DoubleWritable.class);

// specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

38 39 40 41 42 43 44 45 46 47 48 49 50 51


§ And now… the program actually running on a pseudo-‐distributed cluster

Demo


§ Obviously there’s much more to the Hadoop API than this – ParCConers – Combiners – Custom Writables, custom WritableComparables – DistributedCache – Counters – Etc., etc., etc

§ …but even with just this amount of knowledge, you could write real-‐world Hadoop applica;ons

Conclusion


§ Helps companies profit from all their data – Founded by experts from Facebook, Google, Oracle, and Yahoo

§ We offer products and services for large-‐scale data analysis – Socware (CDH distribuCon and Cloudera Manager) – ConsulCng and support services – Training and cerCficaCon

§ Want to aRend a training course? Use the code Nashville_15 for 15% off any Cloudera-‐delivered class

About Cloudera


Technology

Njug presentation