View
222
Download
2
Category
Preview:
Citation preview
Agenda
• Summarization Patterns• Filtering Patterns• Data Organization Patterns• Joins Patterns• Metapatterns• I/O Patterns• Bloom Filters
Overview
• Top-down summarization of large data sets• Most straightforward patterns• Calculate aggregates over entire data set or
groups• Build indexes
Numerical Summarizations
• Group records together by a field or set of fields and calculate a numerical aggregate per group
• Build histograms or calculate statistics from numerical values
Performance
• Perform well, especially when combiner is used
• Need to be concerned about data skew with from the key
Example
• Discover the first time a StackOverflow user posted, the last time a user posted, and the number of posts in between
• User ID, Min Date, Max Date, Count
public class MinMaxCountTuple implements Writable {private Date min = new Date();private Date max = new Date();private long count = 0; private final static SimpleDateFormat frmt =
new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS");
public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; }public void readFields(DataInput in) {
min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong();
}public void write(DataOutput out) {
out.writeLong(min.getTime());out.writeLong(max.getTime());out.writeLong(count);
}
public String toString() {return frmt.format(min) + "\t" + frmt.format(max) + "\t" +
count; }
}
public static class MinMaxCountMapper extends Mapper<Object, Text, Text,
MinMaxCountTuple> {
private Text outUserId = new Text();private MinMaxCountTuple outTuple =
new MinMaxCountTuple();
private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-
dd'T'HH:mm:ss.SSS");
public void map(Object key, Text value, Context context) {
Map<String, String> parsed = xmlToMap(value.toString());
String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId");
Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate)outTuple.setCount(1);outUserId.set(userId);context.write(outUserId, outTuple);
}}
public static class MinMaxCountReducer extends Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple>
{
private MinMaxCountTuple result = new MinMaxCountTuple();
public void reduce(Text key, Iterable<MinMaxCountTuple> values,
Context context) {
result.setMin(null); result.setMax(null); result.setCount(0);
int sum=0;for (MinMaxCountTuple val : values) {
if (result.getMin() == null ||
val.getMin().compareTo(result.getMin()) < 0) { result.setMin(val.getMin());
}if (result.getMax() == null ||
val.getMax().compareTo(result.getMax()) > 0) {result.setMax(val.getMax());
} sum += val.getCount();
}result.setCount(sum); context.write(key, result);
}}
public static void main(String[] args) {
Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {System.err.println("Usage: MinMaxCountDriver <in>
<out>");System.exit(2);
}
Job job = new Job(conf, "Comment Date Min Max Count");job.setJarByClass(MinMaxCountDriver.class);
job.setMapperClass(MinMaxCountMapper.class);job.setCombinerClass(MinMaxCountReducer.class);job.setReducerClass(MinMaxCountReducer.class);
job.setOutputKeyClass(Text.class);job.setOutputValueClass(MinMaxCountTuple.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);}
Inverted Index
• Generate an index from a data set to enable fast searches or data enrichment
• Building an index takes time, but can greatly reduce the amount of time to search for something
• Output can be ingested into key/value store
Counting with Counters
• Use MapReduce framework’s counter utility to calculate global sum entirely on the map side, producing no output
• Small number of counters only!!
Known Uses
• Count number of records• Count a small number of unique field
instances• Sum fields of data together
Filtering
• Discard records that are not of interest
• Create subsets of your big data sets that you want to further analyze
Known Uses
• Closer view of the data• Tracking a thread of events• Distributed grep• Data cleansing• Simple random sampling
Bloom Filtering
• Keep records that are a member of a large predefined set of values
• Inherent possibility of false positives
Known Uses
• Removing most of the non-watched values• Pre-filtering a data set prior to expensive
membership test
Top Ten
• Retrieve a relatively small number of top K records based on a ranking scheme
• Find the outliers or most interesting records
Distinct
• Remove duplicate entries of your data, either full records or a subset of fields
• That fourth V nobody talks about that much
DATA ORGANIZATION PATTERNS
Structured to Hierarchical, Partitioning, Binning,Total Order Sorting, Shuffling
Structured to Hierarchical
• Transformed row-based data to a hierarchical format
• Reformatting RDBMS data to a more conducive structure
Partitioning
• Partition records into smaller data sets
• Enables faster future query times due to partition pruning
Binning
• File records into one or more categories– Similar to partitioning, but the implementation is
different
• Can be used to solve similar problems to Partitioning
Total Order Sorting
• Sort your data set in parallel
• Difficult to apply “divide and conquer” technique of MapReduce
Shuffling
• Set of records that you want to completely randomize
• Instill some anonymity or create some repeatable random sampling
JOIN PATTERNS
Join Refresher, Reduce-Side Join w/ and w/o Bloom Filter,Replicated Join, Composite Join, Cartesian Product
Join Refresher
• A join is an operation that combines records from two or more data sets based on a field or set of fields, known as a foreign key
• Let’s go over the different types of joins before talking about how to do it in MapReduce
How to implement?
• Reduce-Side Join w/ and w/o Bloom Filter• Replicated Join• Composite Join
• Cartesian Product stands alone
Reduce Side Join
• Two or more data sets are joined in the reduce phase
• Covers all join types we have discussed– Exception: Mr. Cartesian
• All data is sent over the network– If applicable, filter using Bloom filter
Performance
• Need to be concerned about data skew• 2 PB joined on 2 PB means 4 PB of network
traffic
Replicated Join
• Inner and Left Outer Joins• Removes need to shuffle any data to the
reduce phase
• Very useful, but requires one large data set and the remaining data sets to be able to fit into memory of each map task
Performance
• Fastest type of join• Map-only
• Limited based on how much data you can safely store inside JVM
• Need to be concerned about growing data sets
• Could optionally use a Bloom filter
Composite Join
• Leverages built-in Hadoop utilities to join the data
• Requires the data to be already organized and prepared in a specific way
• Really only useful if you have one large data set that you are using a lot
Performance
• Good performance, join operation is done on the map side
• Requires the data to have the same number of partitions, partitioned in the same way, and each partition must be sorted
Cartesian Product
• Pair up and compare every single record with every other record in a data set
• Allows relationships between many different data sets to be uncovered at a fine-grain level
Performance
• Massive data explosion!• Can use many map slots for a long time
• Effectively creates a data set size O(n2)– Need to make sure your cluster can fit what you
are doing
Job Chaining
• One job is often not enough• Need a combination of patterns discussed to
do your workflow
• Sequential vs Parallel
Chain Folding
• Each record can be submitted to multiple mappers, then a reducer, then a mapper
• Reduces amount of data movement in the pipeline
Customizing I/O
• Unstructured and semi-structured data often calls for a custom input format to be developed
Generating Data
• Generate lots of data in parallel from nothing
• Random or representative big data sets for you to test your analytics with
External Source Output
• You want to write MapReduce output to some non-native location
• Direct loading into a system instead of using HDFS as a staging area
Known Uses
• Write directly out to some non-HDFS solution– Key/Value Store– RDBMS– In-Memory Store
• Many of these are already written
External Source Input
• You want to load data in parallel from some other source
• Hook other systems into the MapReduce framework
Known Uses
• Skip the staging area and load directly into MapReduce
• Key/Value store• RDBMS• In-Memory store
Partition Pruning
• Abstract away how the data is stored to load what data is needed based on the query
Known Uses
• Discard unneeded files based on the query• Abstract data storage from query, allowing for
powerful middleware to be built
References
• “MapReduce Design Patterns” – O’Reilly 2012
• www.github.com/adamjshook/mapreducepatterns
• http://en.wikipedia.org/wiki/Bloom_filter
Recommended