27
011 © Copyright 20102013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. Hadoop 101: WriCng a Java MapReduce Program Ian Wrigley Sr. Curriculum Manager, Cloudera [email protected] | @iwrigley

Njug presentation

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Njug presentation

01-­‐1  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Hadoop  101:  WriCng  a  Java  MapReduce  Program    Ian  Wrigley  Sr.  Curriculum  Manager,  Cloudera    [email protected]  |  @iwrigley  

Page 2: Njug presentation

01-­‐2  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

And,  by  the  way,  what  is  Hadoop?  

Why  the  World  Needs  Hadoop  

Page 3: Njug presentation

01-­‐3  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Every  day…  – More  than  1.5  billion  shares  are  traded  on  the  NYSE  – Facebook  stores  2.7  billion  comments  and  Likes  

§ Every  minute…  – Foursquare  handles  more  than  2,000  check-­‐ins  – TransUnion  makes  nearly  70,000  updates  to  credit  files  

§ And  every  second…  – Banks  process  more  than  10,000  credit  card  transacCons  

Volume  

Page 4: Njug presentation

01-­‐4  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ We  are  genera;ng  data  faster  than  ever  – Processes  are  increasingly  automated  – People  are  increasingly  interacCng  online  – Systems  are  increasingly  interconnected  

Velocity  

Page 5: Njug presentation

01-­‐5  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ We’re  producing  a  variety  of  data,  including  – Audio  – Video  – Images  – Log  files  – Web  pages  – Product  raCng  comments  – Social  network  connecCons  

§ Not  all  of  this  maps  cleanly  to  the  rela;onal  model  

Variety  

Page 6: Njug presentation

01-­‐6  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ One  tweet  is  an  anecdote  – But  a  million  tweets  may  signal  important  trends  

§ One  person’s  product  review  is  an  opinion  – But  a  million  reviews  might  uncover  a  design  flaw  

§ One  person’s  diagnosis  is  an  isolated  case  – But  a  million  medical  records  could  lead  to  a  cure  

Big  Data  Can  Mean  Big  Opportunity  

Page 7: Njug presentation

01-­‐7  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

A  Scalable  Data  Processing  Framework  

MapReduce  

Page 8: Njug presentation

01-­‐8  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ MapReduce  is  a  programming  model  – It’s  a  way  of  processing  data    

§ In  Hadoop,  you  supply  two  func;ons  to  process  data:  Map  and  Reduce  – Map:  typically  used  to  transform,  parse,  or  filter  data  – Reduce:  typically  used  to  summarize  results  

§ The  Map  func;on  always  runs  first  – The  Reduce  funcCon  runs  acerwards  – The  Hadoop  framework  performs  a  shuffle  and  sort  to  transfer  data  from  the  Map  funcCon  to  the  Reduce  funcCon  

§ Each  piece  is  simple,  but  can  be  powerful  when  combined  

What  is  MapReduce?  

Page 9: Njug presentation

01-­‐9  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ …  in  which  Ian  waves  his  hands  around  and  aRempts  to  explain  the  MapReduce  flow  

MapReduce:  An  Example  

Page 10: Njug presentation

01-­‐10  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ MapReduce  processing  in  Hadoop  is  batch-­‐oriented  

§ Usually  wriRen  in  Java  – This  uses  Hadoop’s  API  directly  – You  can  do  basic  MapReduce  in  other  languages  

– Using  the  Hadoop  Streaming  wrapper  program  – Some  advanced  features  require  Java  code  

MapReduce  Code  for  Hadoop  

Page 11: Njug presentation

01-­‐11  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Some  (very)  basic  concepts:  – Input  and  output  data  is  typed  – The  framework  passes  each  input  record  to  the  Mapper  in  turn  – A  record  is  a  (key,  value)  pair  – For  text  files:  

– The  key  is  the  byte  offset  of  the  start  of  the  line  – The  value  is  the  line  itself  

– Output  data  from  the  Mapper  is  transferred  to  the  Reducer  via  a  process  known  as  the  shuffle  and  sort  – Reducers  receive  (key,  Iterable  of  values)  sets,  in  sorted  key  order  – Job  is  configured  and  executed  using  a  driver  class  

Basic  Java  API  Concepts  

Page 12: Njug presentation

01-­‐12  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

   

Data  Flow  

Map  input  

Map  output   Reduce  input   Reduce  output  

Shuffle  and  sort  

Nashville J. Jones 12.95 2013-07-21 Memphis S. Smith 66.57 2013-07-21 Nashville T. Harding 55.35 2013-07-22 Knoxville S. Warne 10.99 2013-07-22 Kingsport M. Thompson 99.95 2013-07-22

Nashville 12.95 Memphis 66.57 Nashville 55.35 Knoxville 10.99 Kingsport 99.95

Kingsport[99.95] Knoxville[10.99] Memphis [66.57] Nashville[12.95, 55.35]

Kingsport 99.95 Knoxville 10.99 Memphis 66.57 Nashville 68.30

Page 13: Njug presentation

01-­‐13  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Mapper  

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;

public class StoreSalesMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {

1 2 3 4 5 6 7 8 9

10

Input  key  and  value  types  

Output  key  and  value  types  

Page 14: Njug presentation

01-­‐14  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Mapper  

/* * The map method is invoked once for each line of text in the * input data. The method receives a key of type LongWritable * (which corresponds to the byte offset in the current input * file), a value of type Text (representing the line of input * data), and a Context object (which allows us to print status * messages, among other things). */ @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

11 12 13 14 15 16 17 18 19 20 21 22 23

Page 15: Njug presentation

01-­‐15  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Mapper  

String line = value.toString(); // ignore empty lines if (line.trim().isEmpty()) { return; } String[] fields = line.split("\t"); // ensure this line is not malformed if (fields.length != 4) { return; }

24 25 26 27 28 29 30 31 32 33 34 35 36

Convert  value  to  a  Java  String  

Defensive  programming!  

Split  record  into  fields  

Even  more  defensive  programming!  

Page 16: Njug presentation

01-­‐16  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Mapper  

String storeName = fields[0]; Double saleValue = Double.parseDouble(fields[2]); context.write(new Text(storeName), new DoubleWritable(saleValue)); } }

37 38 39 40 41 42 43 44 45 46 47

Output  key  and  value  

Extract  based  on  posiCon  

Page 17: Njug presentation

01-­‐17  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Reducer  

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {

1 2 3 4 5 6 7 8 9

10

Output  key  and  value  types  

Input  key  and  value  types  

Page 18: Njug presentation

01-­‐18  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Reducer  

/* * The reduce method is invoked once for each key received from * the shuffle and sort phase of the MapReduce framework. * The method receives a key of type Text (representing the key), * a set of values of type DoubleWritable, and a Context object. */ @Override public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {

11 12 13 14 15 16 17 18 19

Page 19: Njug presentation

01-­‐19  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Reducer  

// used to sum up the store sales double sum = 0; // add to it it for each new value received for (DoubleWritable value : values) { sum += value.get(); }

// Our output is the event type (key) and the sum (value) context.write(key, new DoubleWritable(sum)); }

}

20 21 22 23 24 25 26 27 28 29 30 31

Output  key  and  value  

Page 20: Njug presentation

01-­‐20  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Driver  

package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class StoreSales { public static void main(String[] args) throws Exception {

1 2 3 4 5 6 7 8 9

10 11 12 13

Page 21: Njug presentation

01-­‐21  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Driver  

// validate command line arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>\n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));

14 15 16 17 18 19 20 21 22 23 24 25 26

Page 22: Njug presentation

01-­‐22  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Driver  

// tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(StoreSales.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Store Sale Aggregator");

// Specify which classes to use for the Mapper and Reducer job.setMapperClass(StoreSalesMapper.class); job.setReducerClass(SumReducer.class);

27 28 29 30 31 32 33 34 35 36 37

Page 23: Njug presentation

01-­‐23  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Java  MR  Job  Example:  Driver  

// specify the Mapper's output key and value classes job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(DoubleWritable.class);

// specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

38 39 40 41 42 43 44 45 46 47 48 49 50 51

Page 24: Njug presentation

01-­‐24  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ And  now…  the  program  actually  running  on  a  pseudo-­‐distributed  cluster  

Demo  

Page 25: Njug presentation

01-­‐25  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Obviously  there’s  much  more  to  the  Hadoop  API  than  this  – ParCConers  – Combiners  – Custom  Writables,  custom  WritableComparables  – DistributedCache  – Counters  – Etc.,  etc.,  etc  

§ …but  even  with  just  this  amount  of  knowledge,  you  could  write  real-­‐world  Hadoop  applica;ons  

Conclusion  

Page 26: Njug presentation

01-­‐26  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§ Helps  companies  profit  from  all  their  data  – Founded  by  experts  from  Facebook,  Google,  Oracle,  and  Yahoo  

§ We  offer  products  and  services  for  large-­‐scale  data  analysis  – Socware  (CDH  distribuCon  and  Cloudera  Manager)  – ConsulCng  and  support  services  – Training  and  cerCficaCon  

§ Want  to  aRend  a  training  course?  Use  the  code  Nashville_15  for  15%  off  any  Cloudera-­‐delivered  class  

About  Cloudera  

Page 27: Njug presentation

01-­‐27  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.