Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1
COSC 6397
Big Data Analytics
Advanced MapReduce
Edgar Gabriel
Spring 2015
Basic statistical operations
• Calculating minimum, maximum, mean, median, standard
deviation
• Data typically multi-dimensional -> analytics can be based on one
or more dimensions of the data
– Exploiting parallelism only on the map side –
– single reducer often required
Image source: Hadoop MapReduce Cookbook, chapter 5.
2
Group-by operations • Calculate basic operations by group
– Allows to utilize more than one reducer
– Grouping based on key of the mapper step
• Example: calculate number of accesses to a webpage based on a
log-file
Image source: Hadoop MapReduce Cookbook, chapter 5.
Frequency distributions
• Arrangement of values that one or more variables take
in a sample
• Each entry in the table contains the number of
occurrences of values within a particular group
• Table summarizes the distribution of values in the
sample
• Example:
– Analyze the log file of a web server
– Sort the number of hits received by each URL in
ascending order
– Input Example: 205.212.115.106 - [01/Jul/1995:00:00:00:12 -0400] “GET
/shuttle/countdown/countdown.html HTTP/1.0” 200 3985
3
Frequency distributions
• First MapReduce job counts the number of occurrences
of a URL
– Result of the MapReduce job: a file containing the list of
<URL> <no. of occurrences>
• Second MapReduce job
– Use the output of first MapReduce job as input
– Mapper: use <no of occurrences> as key and <URL> as
value
– Reducer: omit the <no of occurrences> in output file
(ignoring URL)
• Sorting is implicit by Hadoop framework
Example output
Image source: Hadoop MapReduce Cookbook, chapter 5.
4
Histograms
• Graphical representation of the distribution of data
• Estimate of the probability distribution of a continuous
variable
• Representation of tabulated frequencies, shown as
adjacent rectangles, erected over discrete intervals
– area proportional to the frequency of the observations in
the interval
• Example:
– Determine the number of accesses to the web server per
hour
Image source: Hadoop MapReduce Cookbook, chapter 5.
Histograms
• Map step uses the hour as the key and ‘one’ as the
value
• Reducer sums up the number of occurrences for each
hour
Image source: Hadoop MapReduce Cookbook, chapter 5.
5
Histograms
0
20000
40000
60000
80000
100000
120000
140000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Scatter Plots
• A scatter plot is using Cartesian coordinates to display
values for two variables for a set of data
• Typically used when a variable exists that is under the
control of the user
– a parameter is systematically incremented and/or
decremented by the other,
• also called the control parameter or independent
variable
• is typically plotted along the horizontal axis
– The measured or dependent variable is customarily
plotted along the vertical axis
6
Scatter Plots
• Example: analyzes the data to find the relationship
between the size of the web pages and the number of
hits received by the web page
Image source: Hadoop MapReduce Cookbook, chapter 5.
Scatter Plots
Image source: Hadoop MapReduce Cookbook, chapter 5.
7
Secondary Sorting
• MapReduce sorts intermediate key-value pairs by the
keys during shuffle and sort phase
• Sometimes, additional sorting based on the values
would be useful
• Example: data from sensors
– Intermediate key-value pair:
key = mi, value = ( tj, ri)
with mi being a sensor id
tj being a time stamp
ri being the actual value
– Order of values for a given key is not in increasing order
of timestamps
Secondary Sorting (II)
• Solution: encode sensor id and time stamp in the key
key = mi, value = ( tj, ri) key = (mi:tj) value = ri
but need to ensure that all keys containing mi end up at the
same reduce call!
• Implement three classes:
– Partitioner: which keys are sent to which reducers
– SortComparator: decides how map output keys are sorted
– GroupComparator: decides which map output keys go to the
same reduce method call
job.setPartitionerClass(SensorPartitioner.class);
job.setGroupingComparatorClass(KeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
8
Partitioner public static class SensorPartitioner extends Partitioner<Text,Text>
{
public int getPartition(Text key, Text val, int numReducers) {
String [] tempstring = key.toString().split(“:");
int sensorId = Integer.parseInt(tempstring[1]);
return sensorId % numReducers;
}
}
• Data from one sensor will end up at the same reducer
• Since the keys are still different, the reduce method
will still be invoked separately for each key = (mi:tj)
SortComparator: determine order in which
keys are presented to the reducer
public class CompositeKeyComparator extends WritableComparator {
public int compare(WritableComparable w1, WritableComparable w2) {
String [] t1 = w1.toString().split(“:");
String [] t2 = w2.toString().split(“:");
int s1 = Integer.parseInt(t1[1]);
int s2 = Integer.parseInt(t2[1]);
int result = s1.compareTo(s2);
if(0 == result) {
double d1 = Double.parseDouble(t1[2]);
double d2 = Double.parseDouble(t2[2]);
result = -1 * d1.compareTo(d2);
}
return result;
}
}
Sorting based on
Sensor id
Sorting based on
Time stamp
9
GroupComparator: determine which keys are
grouped together in a single call to a reducer
public class KeyGroupingComparator extends WritableComparator {
public int compare(WritableComparable w1, WritableComparable w2) {
String [] t1 = w1.toString().split(“:");
String [] t2 = w2.toString().split(“:");
int s1 = Integer.parseInt(t1[1]);
int s2 = Integer.parseInt(t2[1]);
return s1.compareTo(s2);
}
}
Graphical flow
s1 0800 x1
s1 0805 x2
s2 0920 x3
s2 0910 x4
s1 0715 x5
s3 1005 x6
input file
map()
map()
map()
s1:0800 x1
s2:0920 x3
s1:0715 x5
intermediate
key-value pairs
reduce()
reduce()
Sensor
Partitioner
s1:0800 x1
s1:0805 x2
s1:0715 x5
s2:0920 x3
s3:1005 x6 s3:1005 x6
s2:0910 x4
s1:0805 x2 s1:0800 x1
s1:0805 x2
s1:0715 x5
CompositeKey
Comparator
s2:0910 x4
s2:0910 x4
s3:1005 x6
s2:0920 x3
KeyGroup
Comparator
s1:0715 x5
s1:0800 x1
s1:0805 x2
s2:0910 x4
s2:0920 x3
s3:1005 x6
10
Joining two Datasets
• Combining input from two (or more) data sets common
problem in Big Data Analytics
– easy to handle with Pig and Hive
– slightly more complicated with MapReduce
• Assumptions:
– data entries having the same keys are combined
– inner joint vs. outer joint possible
Option 1: memory backed join
• Useful if one of the files is relatively small
– maximum a few MBytes
• Solution:
– provide the smaller file as part of the Distributed Cache
mechanism of Hadoop
– read the file in the Mapper setup() function and store
as a global list/array/hashmap
– in map, extract the key of the data entry provided from
the large file and search in global list/array/hashmap for
an entry with the same key
– generate intermediate key value pair, with value being
the combination of both data files for that key
11
Option 2: Map-side join
• Useful if
– Input data set is divided into the same number of
partitions
– Input data is sorted by the same key in each source
• Often the case of the data sets are the result of
previous map-reduce jobs
– All records for a particular key must reside in the same
partition
• Tool/program available if you want to join all entries
– For customizing you can write your own map-reduce job
• TupleWritable
– Writable type storing multiple Writable
– retrieve i-th Writable with the .get(i) method
– Users are encouraged to implement their own serializable
types in most cases
• In main()
job.setInputFormatClass(CompositeInputFormat.class);
String joinStatement = CompositeInputFormat.compose("inner",
KeyValueTextInputFormat.class,
new Path “/someinput");
job.getConfiguration().set("mapreduce.join.expr", joinStatement);
12
If you need to customize your join
public class MapSideJoinMapper extends
Mapper<Text, TupleWritable, Text, Text> {
Text txtValue = new Text("");
public void map ( Text key, TupleWritable value,
Context context) throws IOException {
if (value.toString().length() > 0) {
String arr1[] = value.get(0).toString().split(",");
String arr2[] = value.get(1).toString().split(",");
txtValue.set(arr1[1].toString() +arr1[2].toString() +
+ arr2[0].toString());
context.write(key, txtValue);
}
}
}
Reducer Side Join
• Most generic but also most expensive case
– Both data files go through mapper and the shuffle/sort step
– Reducer combines the data emitted for the same key
• MultipleInputs
– supports MapReduce jobs that have multiple input paths with a
different InputFormat and Mapper for each path
– To simplify logic in the reducer secondary sorting might be
required to ensure that data arrives in the correct order at the
reducer
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, PostsMapper.map);
MultipleInputs.addInputPath(job1, new Path(args[1]),
TextInputFormat.class, UsersMapper.map);