14
Cascading on Starfish Fei Dong Duke University [email protected] December 10, 2011 1 Introduction Hadoop [6] is a software framework installed on a cluster to permit large scale distributed data analysis. It provides the robust Hadoop Distributed File System (HDFS) as well as a Java-based API that allows parallel processing across the nodes of the cluster. Programs employ a Map/Reduce execution engine which functions as a fault-tolerant distributed computing system over large data sets. In addition to Hadoop, which is a top-level Apache project, there are sub- projects related to workflow of Hadoop, such as Hive [8], a data warehouse framework used for ad hoc querying (with an SQL type query language); and Pig [9], a high-level data-flow language and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop. Cascading [2], an API for defining and executing fault tolerant data processing workflows on a Hadoop cluster. All of mentioned projects simplify some of work for developers, allowing them to write more traditional procedural or SQL-style code that, under the covers, creates a sequence of Hadoop jobs. In this report, we focus on Cascading as the main data-parallel workflow choice. 1.1 Cascading Introduction Cascading is a Java application framework that allows you to more easily write scripts to access and manipulate data inside Hadoop. There are a number of key features provided by this API: Dependency-Based ’Topological Scheduler’ and MapReduce Planning - Two key components of the cascading API are its ability to sched- ule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for con- current invocation of portions of flows and cascades. In addition, the 1

Cascading on starfish

Embed Size (px)

Citation preview

Page 1: Cascading on starfish

Cascading on Starfish

Fei DongDuke University

[email protected]

December 10, 2011

1 Introduction

Hadoop [6] is a software framework installed on a cluster to permit large scaledistributed data analysis. It provides the robust Hadoop Distributed FileSystem (HDFS) as well as a Java-based API that allows parallel processingacross the nodes of the cluster. Programs employ a Map/Reduce executionengine which functions as a fault-tolerant distributed computing system overlarge data sets.

In addition to Hadoop, which is a top-level Apache project, there are sub-projects related to workflow of Hadoop, such as Hive [8], a data warehouseframework used for ad hoc querying (with an SQL type query language);and Pig [9], a high-level data-flow language and execution framework whosecompiler produces sequences of Map/Reduce programs for execution withinHadoop. Cascading [2], an API for defining and executing fault tolerantdata processing workflows on a Hadoop cluster. All of mentioned projectssimplify some of work for developers, allowing them to write more traditionalprocedural or SQL-style code that, under the covers, creates a sequence ofHadoop jobs. In this report, we focus on Cascading as the main data-parallelworkflow choice.

1.1 Cascading Introduction

Cascading is a Java application framework that allows you to more easilywrite scripts to access and manipulate data inside Hadoop. There are anumber of key features provided by this API:

• Dependency-Based ’Topological Scheduler’ and MapReduce Planning- Two key components of the cascading API are its ability to sched-ule the invocation of flows based on dependency; with the executionorder being independent of construction order, often allowing for con-current invocation of portions of flows and cascades. In addition, the

1

Page 2: Cascading on starfish

steps of the various flows are intelligently converted into map-reduceinvocations against the hadoop cluster.

• Event Notification - The various steps of the flow can perform notifi-cations via callbacks, allowing for the host application to report andrespond to the progress of the data processing.

• Scriptable - The Cascading API has scriptable interfaces for Jython,Groovy, and JRuby.

Although Cascading provides the above benefits, we still consider aboutthe balance of the performance and productivity on Cascading. Marz [5]shows some rules to optimize Cascading Flows. For some experienced Cas-cading users, they can gain some performance improvement by followingthose principles in high level. One interesting questions is whether there ex-ist some ways to improve the workflow performance without expert knowl-edge. In other words, we want to optimize workflow in physical level. InStarfish [7], the authors demonstrate the power of self-tuning jobs on Hadoopand Herodotou has successfully applied optimization technology on Pig.This report will discuss auto-optimization on Cascading with the help ofStarfish.

2 Terminology

First, we introduce some concepts widely used in Cascading.

• Stream: data input and output.

• Tuple: stream is composed of a series of Tuples. Tuples are sets ofordered data.

• Tap: abstraction on top of Hadoop files. Source - A source tap is readfrom and acted upon. Actions on source taps result in pipes. Sink -A sink tap is a location to be written to. A Sink tap can later serveas a Source in the same script.

• Operations: define what to do on the data. i.e.: Each(), Group(),CoGroup(), Every().

• Pipe: tie Operation together. When an operation is executed upon aTap, the result is a Pipe. In other words, a flow is a pipe with dataflowing through it. i.e: Pipes can use other Pipes as input, therebywrapping themselves into a series of operations.

• Filter: pass through it to remove useless records. i.e. RegexFilter(),And(), Or().

2

Page 3: Cascading on starfish

• Aggregator: function after group operation. i.e. Count(), Average(),Min(), Max().

• Step: a logic unit in Flow. It represents a Map-only or MapReducejob.

• Flow: A Flow is a combination of a Source, a Sink and Pipe.

• Cascade:a series of Flows.

3 Cascading Structure

Figure 1: A Typical Cascading Structure.

In Figure 1, we can clearly see that Cascading Structure. The top levelis called Cascading which is composed of several flows. In each flow, itdefines a source Tap, a sink Tap and Pipes. We also notice one flow canhave multiple pipes to do data operations like filter, grouping, aggregator.

Internally, a Cascade is constructed through the CascadeConnector class,by building an internal graph that makes each Flow a ’vertex’, and each filean ’edge’. A topological walk on this graph will touch each vertex in orderof its dependencies. When a vertex has all it’s incoming edges available, itwill be scheduled on the cluster. Figure 2 gives us an example which goalis to statistic second and minute count from Apache logs. The dataflow isrepresented as a Graph. The first step is to import and parse source data.Next it generates two following steps to process ”second” and ”minutes”respectively.

The execution order for Log Analysis is:1. calculate the dependency between flows, so we get Flow1→ Flow22. start to call Flow1

2.1 initialize ”import” flowStep and construct the Job12.2 submit ”import” Job1 to Hadoop

3. start to call Flow23.1 initialize ”minute and secend statistics” flowSteps and construct

the Job2, Job33.2 submit Job2, Job3 to Hadoop

The complete code is attached at Appendix.

3

Page 4: Cascading on starfish

Figure 2: Workflow Sample:Log Analysis.

4 Cascading on Starfish

4.1 Change to new Hadoop API

We notice current Cascading is based on Hadoop Old-API. Since Starfishonly works within New-API, the first work is to connect those heterogeneoussystems. Herodotos works on supporting Hadoop Old-API on Starfish. Iwork on replacing Old-API of Cascading with New-API. Although Hadoopcommunity recommends new API and provide some upgrade advice [11], itstill take us much energy on translating. One reason is the system complexity(40K lines), we sacrifice some advanced features such as S3fs, TemplateTap,ZipSplit, Stats reports and Strategy to make the change work. Finally, weprovide a revised version of Cascading that only use Hadoop New-API. Inthe mean time Herodotos updated Starfish to support Old-API recently.While this report will only consider New-API version of Cascading.

4.2 Cascading Profiler

First, we need to decide when to capture the profilers. Since modified Cas-cading is using Hadoop New-API, the position to enable Profiler is the sameas a single MapReduce job. We choose the return point of blockT illCompleteOrStopedof cascading.flow.F lowStepJob to collect job execution files when job com-

4

Page 5: Cascading on starfish

pletes. When all of jobs are finished and execution files are collected, wewould like to build a profile graph to represent dataflow dependencies amongthe jobs. In order to build the job DAG, we decouple the hierarchy of Cas-cading and Flows. As we see before, Log Analysis workflow has two de-pendent Flows and finally will submit three MapReduce jobs on Hadoop.Figure 3 shows the original Workflow in Cascading and translating JobGraphin Starfish. We propose the following algorithm to build Job DAG.

Algorithm 1 Build Job DAG Pseudo-Code

1: procedure BuildJobDAG(flowGraph)2: for flow ∈ flowGraph do . Iterate over all flows3:

4: for flowStep ∈ flow.flowStepGraph do . Add the jobvertices

5: Create the jobVertex from the flowStep6: end for7:

8: for edge ∈ flow.flowStepGraph.edgeSet do . Add the jobedges within a flow

9: Create the corresponding edge in the jobGraph10: end for11: end for

12: for flowEdge ∈ flowGraph.edgeSet do . Iterate over all flowedges (source→ target)

13: sourceF lowSteps← flowEdge.sourceF low.getLeafF lowSteps14: targetF lowSteps← flowEdge.targetF low.getRootF lowSteps15: for sourceFS ∈ sourceF lowSteps do16: for targetFS ∈ targetF lowSteps do17: Create the job edge from corresponding source to target18: end for19: end for20: end for21: end procedure

4.3 Cascading What-if Engine and Optimizer

What-if Engine is to predict the behavior of a workflow W . To achievethat, DAG Profilers ,Data Model, Cluster, DAG Configurations are givenas parameters. Building the Conf Graph shares the same idea as buildingJob Graph. We capture the returning point of initializeNewJobMap incascading.cascade where we process what-if requests and exit the programafterwards.

5

Page 6: Cascading on starfish

(a) Cascading Represent (b) Dataflow Transla-tion

Figure 3: Log Analysis.

For the Cascading optimizer, I make use of data flow optimizer and feedthe related interface. When running the Optimizer, we keep the defaultOptimizer mode as crossjob + dynamic.

4.4 Program Interface

The usage of Cascading on Starfish is simple and user-friendly. Users do notneed to change the source code or import new package. We can list somecases as follows.

profile cascading jar loganalysis.jarProfiler: collect task profiles when running a workflow and generate the

profile files in PROFILER OUTPUT DIR.

execute cascading jar loganalysis.jarExecute: only run program without collecting profiles.

analyze cascading details workflow 20111017205527Analyze: list some List basic or detail statistical information regarding

all jobs found in the PROFILER OUTPUT DIR

whatif details workflow 20111018014128 cascading jar loganalysis.jarWhat-if Engine: ask hypothetical question on a particular workflow and

return predicted profiles.

optimize run workflow 20111018014128 cascading jar loganalysis.jarOptimizer:Execute a MapReduce workflow using the configuration pa-

rameter settings automatically suggested by the Cost-based Optimizer.

6

Page 7: Cascading on starfish

5 Evaluation

5.1 Experiment Environment

In the experimental evaluation, we used Hadoop clusters running on AmazonEC2. The following is the detail preparation.

• Cluster Type: m1.large 10 nodes. Each node has 7.5 GB memory,2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasksconcurrently.

• Hadoop Configurations: 0.20.203.

• Cascading Version : modified V1.2.4 (use Hadoop New-API)

• Data Set: 20G TPC-H [10], 10G random text, 10G pagegraphs forpagerank, 5G paper author pairs.

• Optimizer Type: cross jobs and dynamic

5.2 Description of Data-parallel Workflows

We evaluate the end-to-end performance of optimizers on seven representa-tive workflows used in different domains.

Term Frequency-Inverse Document Frequency(TF-IDF): TF-IDF calculates weights representing the importance of each word to a doc-ument in a collection. The workflow contains three jobs: 1) the total termsin each document. 2) calculate the number of documents containing eachterm. 3) calculate tf * idf. Job 2 depends on Job 1 and Job 3 depends onJob 2.

Top 20 Coauthor Pairs: Suppose you have a large datasets of papersand authors. You want to know who and if there is any correlation betweenbeing collaborative and being a prolific author. It can take three jobs: 1)Group authors by paper. 2) Generate co-authorship pairs (map) and count(reduce). 3) Sort by count. Job 2 depends on Job 1 and Job 3 depends onJob 2.

Log Analysis: Given an Apache log, parse it with specified format,statistic the minute count and second count seperately and dump each re-sults. There are three jobs: 1) Import and parse raw log. 2) group byminutes and statistic counts. 3) Group by seconds and statistic counts. Job2 and Job 3 depends on Job 1.

PageRank: The goal is to find the ranking of web pages. The algorithmcan be implemented as an iterative workflow containing two jobs:1) Join onthe pageId of two datasets.2) Calculate the new rankings of each webpage.Job 2 depends on Job 1.

TPC-H: TPC-H benchmark as a representative example of a complexSQL query. Query 3 is implemented in four-job workflow. 1) Join the order

7

Page 8: Cascading on starfish

and customer table, with filter conditions on each table. 2) Join lineitemand result table in job one. 3) Calculate the volume by discount. 4) Getthe sum after grouping by some keys. Job 2 depends on Job 1 and Job 3depends on Job 2 and Job 4 depends on Job 3.

HTML Parser and WordCount: The workflow processes a collectionof web source pages. It has three jobs: 1) Parse the raw data with HTMLSAX Parser. 2) Statistic the number of words with the same urls. 3) Ag-gregate the total word count. Job 2 depends on Job 1 and Job 3 dependson Job 2.

User-defined Partition: It spill the dataset into three parts by therange of key. Some statistics are collected on each spilled part. In general,it is run in three jobs and each job is responsible for one part of dataset.There is no dependency between those three job, which means three jobscan be run in parallel.

The source code for experiment groups is submitted in Starfish reposi-tory.

5.3 Speedup with Starfish Optimizer

Figure 4 shows the timeline for TPC-H Query 3 workflow. When usingprofiler, it spends 20% more time. The final cross-job optimizer causes 1.3xspeedup. Figure 5 analysis the speedup for six workflows respectively. Theoptimizer is effective for most of workflows with only exception of the user-defined partition. One possible reason is that workflows generates three jobsin parallel which compete the limited cluster resource (30 available map slotsand 20 available reduce slots) from each other.

Figure 4: run TPC-H Query3 with no Optimizer, Profiler and Optimizer.

8

Page 9: Cascading on starfish

(a) Log Analysls (b) Coauthor Pairs (c) PageRank

(d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word-count

Figure 5: Speedup with Starfish Optimizer.

5.4 Overhead on Profiler

Figure 6 shows the profiling overhead by comparing againest the same jobrun with profiling turned off. In average, profiling consumes 20% of therunning time.

5.5 Compare with Pig

We are very interested in comparing various workflow framework with thesame datasets. We run the identical workflow written by Harold Lim. Fig-ure 7 shows the performance of Pig overwhelm Cascading even if Cascadingis optimized. We think of several possible reasons.

• Cascading does not support Combiner. One article [4] talks abouthand-rolled join optimizations.

• Pig does many optimization work on physical and logic layer, whileCascading does not optimize the planner well. In user-defined parti-tion, Pig only has one MapReduce job while Cascading populates 3jobs.

• Cascading only uses Customed Inputformat and InputSplit called Mul-tiInputFormat and MultiInputSplit, no matter for single job or single

9

Page 10: Cascading on starfish

(a) Log Analysls (b) Coauthor Pairs (c) PageRank

(d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word-count

Figure 6: Overhead to measure profile.

Figure 7: Pig Versus Cascading on Performance .

input source.

• Cascading’s CoGroup() join is not meant to be used with large datafiles.

• Using RegexSplit to parse files into tuples is not efficient.

• Disable compression.

10

Page 11: Cascading on starfish

6 Conclusion

Cascading aims to help developers build powerful applications quickly andsimply, through a well-reasoned API, without needing to think in MapRe-duce, while leaving the heavy lifting of data distribution, replication, dis-tributed process management, and liveness to Hadoop.

With Starfish Optimizer, we can boost the original Cascading programby 20% to 200% without modifying any source code. It also demonstratesthat the similar syntax sentences as Pig in representation, but the experi-ment group display distinct differences in results, which shows Pig perfor-mance is much better than Cascading in most cases.

Considering the code scale, learning cost and performance, we recom-mend for simple queries, using Pig is much more suitable and performant.We also find Cascalog [3] which is data processing and querying library forClojure, is another choice of writing workflows on Hadoop.

We notice Cascading 2.0 [1] is ready to release, which will improve hugelyon performance and fix bugs of previous version. For the future work, whenStarfish supports old API, we can import latest version of Cascading andrerun the experiment.

7 Acknowledgement

I would like to thank Herodotos Herodotou, the lead contributor of Starfish,who gave me so much help on the system design and Hadoop internal mech-anism. The report could not be done without him. I also want to thankHarold Lim who gives me some support on benchmarks.

Thank Professor Shivnath Babu for his help and supervising this work,and holding meeting for us to exchange ideas.

References

[1] Tips for Optimizing Cascading Flows. http://www.cascading.org/2011/10/cascading-20-early-access.html.

[2] Cascading. http://www.cascading.org/.

[3] Cascalog. https://github.com/nathanmarz/cascalog.

[4] Pseudo Combiners in Cascading. http://blog.rapleaf.com/dev/2010/06/10/pseudo-combiners-in-cascading/.

[5] Tips for Optimizing Cascading Flows. http://nathanmarz.com/blog/tips-for-optimizing-cascading-flows.html.

[6] Apache Hadoop. http://hadoop.apache.org/.

[7] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu.Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.

11

Page 12: Cascading on starfish

[8] Hive. http://hadoop.apache.org/hive/.

[9] Pig. http://hadoop.apache.org/pig/.

[10] TPC. TPC Benchmark H Standard Specification, 2009. http://www.tpc.org/tpch/spec/tpch2.9.0.pdf.

[11] Upgrading to the New Map Reduce API. http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api.

8 Appendix

8.1 Complete Source Code of Log Analysis

Listing 1: LogAnalysis.java

1 package loganalysis;2

3 import java.util.*;4 import org.apache.hadoop.conf.*;5 import org.apache.hadoop.util.*;6

7 import cascading.cascade.*;8 import cascading.flow.*;9 import cascading.operation.aggregator.Count;

10 import cascading.operation.expression.ExpressionFunction;11 import cascading.operation.regex.RegexParser;12 import cascading.operation.text.DateParser;13 import cascading.pipe.*;14 import cascading.scheme.TextLine;15 import cascading.tap.*;16 import cascading.tuple.Fields;17

18 public class LogAnalysis extends Configured implements Tool {19 public int run(String[] args) throws Exception {20 // set the Hadoop parameters21 Properties properties = new Properties();22 Iterator<Map.Entry<String, String>> iter = getConf().

iterator();23 while (iter.hasNext()) {24 Map.Entry<String, String> entry = iter.next();25 properties.put(entry.getKey(), entry.getValue());26 }27

28 FlowConnector.setApplicationJarClass(properties, Main.class);

29 FlowConnector flowConnector = new FlowConnector(properties);

30 CascadeConnector cascadeConnector = new CascadeConnector();31

32 String inputPath = args[0];33 String logsPath = args[1] + "/logs/";34 String arrivalRatePath = args[1] + "/arrivalrate/";

12

Page 13: Cascading on starfish

35 String arrivalRateSecPath = arrivalRatePath + "sec";36 String arrivalRateMinPath = arrivalRatePath + "min";37

38 // create an assembly to import an Apache log file andstore on DFS

39 // declares: "time", "method", "event", "status", "size"40 Fields apacheFields = new Fields("ip", "time", "method", "

event",41 "status", "size");42 String apacheRegex = "ˆ([ˆ ]*) +[ˆ ]* +[ˆ ]* +\\[([ˆ]]*)\\]

+\\\"([ˆ ]*) ([ˆ ]*) [ˆ ]*\\\" ([ˆ ]*) ([ˆ ]*).*$";43 int[] apacheGroups = { 1, 2, 3, 4, 5, 6 };44 RegexParser parser = new RegexParser(apacheFields,

apacheRegex,45 apacheGroups);46 Pipe importPipe = new Each("import", new Fields("line"),

parser);47

48 // create tap to read a resource from the local file system, if not an

49 // url for an external resource50 // Lfs allows for relative paths51 Tap logTap = new Hfs(new TextLine(), inputPath);52 // create a tap to read/write from the default filesystem53 Tap parsedLogTap = new Hfs(apacheFields, logsPath);54

55 // connect the assembly to source and sink taps56 Flow importLogFlow = flowConnector.connect(logTap,

parsedLogTap,57 importPipe);58

59 // create an assembly to parse out the time field into atimestamp

60 // then count the number of requests per second and perminute

61

62 // apply a text parser to create a timestamp with ’second’granularity

63 // declares field "ts"64 DateParser dateParser = new DateParser(new Fields("ts"),65 "dd/MMM/yyyy:HH:mm:ss Z");66 Pipe tsPipe = new Each("arrival rate", new Fields("time"),

dateParser,67 Fields.RESULTS);68

69 // name the per second assembly and split on tsPipe70 Pipe tsCountPipe = new Pipe("tsCount", tsPipe);71 tsCountPipe = new GroupBy(tsCountPipe, new Fields("ts"));72 tsCountPipe = new Every(tsCountPipe, Fields.GROUP, new

Count());73

74 // apply expression to create a timestamp with ’minute’granularity

75 // declares field "tm"

13

Page 14: Cascading on starfish

76 Pipe tmPipe = new Each(tsPipe, new ExpressionFunction(newFields("tm"),

77 "ts - (ts % (60 * 1000))", long.class));78

79 // name the per minute assembly and split on tmPipe80 Pipe tmCountPipe = new Pipe("tmCount", tmPipe);81 tmCountPipe = new GroupBy(tmCountPipe, new Fields("tm"));82 tmCountPipe = new Every(tmCountPipe, Fields.GROUP, new

Count());83

84 // create taps to write the results the default filesystem,using the

85 // given fields86 Tap tsSinkTap = new Hfs(new TextLine(), arrivalRateSecPath,

true);87 Tap tmSinkTap = new Hfs(new TextLine(), arrivalRateMinPath,

true);88

89 // a convenience method for binding taps and pipes, orderis significant

90 Map<String, Tap> sinks = Cascades.tapsMap(Pipe.pipes(tsCountPipe,

91 tmCountPipe), Tap.taps(tsSinkTap, tmSinkTap));92

93 // connect the assembly to the source and sink taps94 Flow arrivalRateFlow = flowConnector.connect(parsedLogTap,

sinks,95 tsCountPipe, tmCountPipe);96

97 // optionally print out the arrivalRateFlow to a graph filefor import

98 // into a graphics package99 //arrivalRateFlow.writeDOT( "arrivalrate.dot" );

100

101 // connect the flows by their dependencies, order is notsignificant

102 Cascade cascade = cascadeConnector.connect(importLogFlow,103 arrivalRateFlow);104

105 // execute the cascade, which in turn executes each flow independency

106 // order107 cascade.complete();108 return 0;109 }110

111 public static void main(String[] args) throws Exception {112 int res = ToolRunner.run(new Configuration(), new Main(),

args);113 System.exit(res);114 }115 }

14