Distributed Data Analysis Using Map-Reduce

IS470 Guided Research in Information Systems

Distributed Data Analysis

Using Map-Reduce

Student: Shahfik Amasha ([email protected]) Faculty Supervisor: Asst Prof Jason Woodard

([email protected])

Abstract: Hadoop clusters are relatively easy to set up and manage, making them an invaluable tool for researchers to quickly get up and running on crunching large datasets. The Hadoop benchmarks show an a linear trend for processing different size of data for the same number of nodes and a leveling off for throughput performance. This is useful to know the optimum number of nodes to provision in a cluster. Finally, there is a huge potential for further the project into interactive use cases and real-time data analytics

1 Introduction Increasingly, more businesses are collecting data about their businesses, often times without even being aware of it. From customer sales data to employee performance data, the typical business is not taking full advantage of the deep insights that the data is able to provide. However, there is a silver lining as businesses have just recently recognized the value of the data that have been sitting within their organizations. IDC estimates that the worldwide market for business analytics is worth $25 billion in 2009, a growth of 4 percent over 2008 (IBM, 2009). IT service providers who are attuned to their customers needs are picking up on the growing interests of the industry by providing solutions for the various functional areas of a business. Companies are using analytics in various functions such as supply chain, customer relationship management and pricing. Amazon is a good example of a company that has leveraged on the power of data analytics to provide customized recommendations to its customers. In the recent Global CIO Study 2009 by IBM, a survey of global CIOs, eighty three percent of survey respondents identified business intelligence and analytics as a competitive advantage for their organizations. In addition, IBM CIO Pat Toole commented that CIOs are investing in business analytics capabilities to help them improve decision making and can be key to new growth markets (IBM, 2009). In The McKinsey Quarterly (2007), the consulting firm identified analytics as one of the eight business technology trends to watch in the next decade. Davenport (2006) highlighted that as firms in many industries offer similar products using comparable technologies, business processes are among the last remaining points of differentiation. Analytics allows competitors to get the most value from those processes to make the best decisions at every level of the firm. Now that businesses are waking up to this new reality, they are finding solutions that could help them build this capability. MapReduce is one framework for doing analytics on large amounts of data, and Hadoop is a solution that has implemented this framework. Other solutions include building data warehouses and implementing database clusters, which both require a large capital investment and high operational costs. In this paper, we use Hadoop, an Apache Foundation project, for doing data analytics on a large dataset. Hadoop is based on the Java programming language. The foundation of this paper is the start of collaboration between the Singapore Management University School of Information Systems (SMU SIS) and a large taxi company two years ago. The goal was to gain better insights into the dynamics that occur in a taxi network using the GPS location data that they provide using data analytics. Leveraging on the ongoing efforts with the company, this project hopes to use the MapReduce algorithm to accelerate the process of studying of these dynamics. The current process for discovering these dynamic is to use a large database for querying, which is extremely slow due to the large dataset. The dataset stands at a couple of hundred gigabytes at the moment, and the typical query would only cover a period of a day’s worth of data and take about half a day to complete. MapReduce is one of the methods used in doing distributed data analysis, first made popular through its use by Google for crawling and indexing the web. In 2004, Google published a paper titled “MapReduce: Simplified Data Processing on Large Clusters” which sparked off a series of

events that led to the formation of Hadoop. Hadoop forms the basis of the data analytics used in this project.

2 Hadoop In several use cases of Hadoop, it has tremendously saved time and money. The New York Times used 100 machines on the Amazon Elastic Computing Cloud to convert 4 terabytes of scanned archives into PDFs within 24 hours. In a contest of speed, Hadoop broke the world record on the fastest system to sort a terabyte of data with a time of 209 seconds in 2008. In the following year, it took just 62 seconds to perform the same feat. The Hadoop project consists of various sub-projects (Figure 1), each with a specific use of MapReduce. The use cases introduced later will make use of the Hadoop Core sub-project. The Core sub-project allows a developer to create programs that does data crunching on the Hadoop cluster.

Figure 1: Hadoop sub-projects. Source: Hadoop: The Definitive Guide

The MapReduce API (Figure 2) sets up the mapping and reduction process for the developer. All the developer has to do is to write code within the mapping and reduction function to determine the operations that will be executed on the data coming in. This will be clearer when explained in the use case later.

Figure 2: The MapReduce API. Source: Hadoop: The Definitive Guide

Once a program is developed and the dataset uploaded to the HDFS shared filesystem, it is submitted to Hadoop (Figure 3), where it will manage the distribution of the job to the cluster. In the diagram below, the client node is the developer’s machine and the jobtracker node is the master node in the Hadoop cluster while the tasktracker node is the slave node in the cluster. The Hadoop cluster can be made up of one or more slave nodes. The tasktracker continuously sends heartbeats the jobtracker to tell it that it is still alive. The jobtracker automatically stops assigning jobs to a tasktracker after a user-defined period of inactivity. While the tasktracker is working on the job, it continues sending heartbeats with progress information. When a job finishes, the client is informed and the client JVM exits.

Figure 3: Job submission. Source: Hadoop: The Definitive Guide

Hadoop provides an administration interface (Figure 4) to query the status and details of a job. The detailed view includes various performance indicators such as bytes read from the HDFS shared filesystem. It also keeps a history of all the jobs that has been submitted previously.

Figure 4: The admin interface

3 Use Cases There will be 2 use cases that will be presented. The first is generating secondary data from the primary data source and the second is generating a frequency analysis of the GPS location. These two use cases are chosen because the result requires a sweep of the entire dataset, something which Hadoop is particularly good at. This begs the question of when do we use a traditional RDBMS and when do we use MapReduce. Below is a good summary of the characteristics of each approach.

Traditional RDBMS MapReduce

Data Size Gigabytes Petabytes

Access Interactive and batch Batch

Updates Read and write many times Write once, read many times

Structure Static schema Dynamic schema

Integrity High Low

Scaling Nonlinear Linear Table 1: Comparison between RDBMS and MapReduce. Source: Hadoop: The Definitive Guide

In both of these use cases, we simply need to output a single file which contains the results. This fits with the update characteristic. As you will see later the raw data is cleaned and processed before being operated on, though this is not necessary as Hadoop has a flexible dynamic schema and operates best on text files.

3.1 The dataset As mentioned in the introduction, faculty at the SMU School of Information Systems (SIS) have been collaborating the taxi company to analyze GPS traces and trip data from their fleet of about 15,000 taxis. This effort has resulted in a dataset of over 4 billion GPS observations from 150 million trips (about 300 GB in uncompressed form). A major bottleneck in the analysis is the time required to run algorithms over the entire data set, which can take weeks on a single machine. As a result, the published results have been limited to analysis on a day or week’s worth of data at a time. Thus, we explored the possibility of using a distributed systems approach to break this bottleneck and chose Hadoop.

In the diagram below, Hermes-1 is a single machine and Beowulf-5 is a 6 node cluster (1 master and 5 slaves). They represent the resources which are available to this paper and are available within the university. The use cases discussed here are executed on Beowulf-5. Cirrus-x is part of the Open Cirrus Could Computing Testbed initiative by the Infocomm Development Authority of Singapore (IDA). Cirrus-15 and Cirrus-60 are the theoretical projections of a 15 node and 60 node clusters respectively. Distributing the computation using Hadoop should allow the analysis of larger subsets of the data in less time – potentially even the full year’s worth of data in hours, or a day’s worth in seconds.

Figure 5: Time required versus scale of analysis that can be done

The raw dataset provided by the taxi company is in the following format: Date time, vehicle no., driver ID, long, lat, speed, status

E.g.:

01/03/2009 00:00:00, SH1234S, 1809481,103.94063, 1.32617, 0, PAYMENT

This dataset is cleaned for errors and processed for anonymity. The final dataset is provided and is not part of this paper. The final record that the use cases will be operating on will be as follows: LogSerialNo,Datetime,vehicleID,driverID,long,lat,speed,status,week,DayOfWeek,day,hour

E.g.:

20090301000000000,2009/03/01 00:00:00,454,1809481,103.94063,1.32617,0,3,1520,0,01,00

The hardware configuration for each of the node in the Beowulf-5 cluster is the same and is as follows:

Item Description

Processor 2 x AMD Opteron 250

RAM 4 GB

3.2 Use Case 1 The first use case generates a file which shows the start and end time of a particular taxi in a particular state. The algorithm is illustrated as follows:

Figure 6: Use case 1 algorithm

An input split is a section of the original dataset. In this use case splits are of size 64MB. Tasktrackers are fed with splits and the developer’s program is executed on them.

1. Mapper. Each line in the file is read and represented as a <key, value> pair to the Mapper. In this instance, the key is the byte pos of the start of the line and the value is the line itself. I parse the line and extract out the vehicle ID and log serial number. I created a custom key consisting of a combination of the two pieces of information. A <VehicleSnPair, value> is then written out, where the value is the original line.

2. The GroupComparator sorts the records by vehicle ID. The output is the same key value pair as above.

3. The SortComparator then sorts by log serial number within each vehicle ID. The output is the same key value pair as above.

4. The output from the SortComparator usually goes to the Reducer. But in this use case, a custom Partitioner is implemented to determine the machine that a vehicle ID goes to.

5. The Reducer writes the relevant information to the output. The algorithm is executed on varying input file sizes, from 16MB to 30GB. The results are shown in two separate graphs below, from 16MB to 1GB (Figure 7) and from 1GB to 30GB (Figure 8). In Figure 7, there is an increasing marginal return as the file size increases from 16MB to 64MB. This is due to the file split size of 64MB. As the file size goes beyond 64MB, the amount of time it takes is roughly linear. This is also observed in Figure 8. The jobs are executed on a 5 slave node configuration, (noted by n=5). Each job is executed for 3 times and the average time is taken.

Figure 7: Average job completion time on 5 slave nodes for file sizes between 16MB to 1024MB

29.0037.67

44.0045.00 48.67

61.00

87.00

98.67

0.00

20.00

40.00

60.00

80.00

100.00

120.00

0 200 400 600 800 1000 1200

Seco

nd

s

Size in MB

Avg time, n=5, size=16MB to 1024MB

Figure 8: Average job completion time on 5 slave nodes for file sizes between 1GB to 30GB

Figure 9 and Figure 10 shows the same job executed for 3 and 4 slave nodes to understand how the number of nodes affect the job completion time. Figure 9 shows the overall trend which is a relatively linear. Figure 10 shows a smaller range of file size. There is no significant change in the pattern, other than the increased time taken for lesser number of slave nodes.

Figure 9: Average job completion time over 3, 4 and 5 nodes for file sizes 16MB to 8096MB

98.67325.00

671.67

1,022.33

1,314.67

1,717.00

2,792.67

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

0 5000 10000 15000 20000 25000 30000 35000

Seco

nd

s

Size in MB

Avg time, n=5, size=1GB to 30GB

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Seco

nd

s

Size in MB

Avg time, n={3,4,5}, size=16MB to 8096MB

n(5) n(4) n(3)

Figure 10: Average job completion time over 3, 4 and 5 nodes for file sizes 16MB to 1024MB

From the above graphs we note that there is an increasing marginal return as the file size increases. The point at which there is no longer an increase tells us the optimum file size to work on. Figure 11 tells us that from file size 4096MB, the speed is constant and there is no increase in marginal returns from larger file sizes. The results are however limited, as we would normally expect to see a point at which there is diminishing marginal returns. Given more time, there is a possibility of finding out the point at which that happens, though Figure 8 does show that it scales almost linearly to 30GB.

Figure 11: Throughput over 3, 4 and 5 nodes for file sizes 16MB to 8096MB

As the size of the analysis required for gathering insights into the taxi data increases, Hadoop has shown to be able to scale linearly to a month’s worth of data. Further experiments will need to be done to determine if the Hadoop is able to scale to a year’s worth of data because patterns that are observed over that time period would prove useful for answering research questions.

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

0 200 400 600 800 1000 1200

Seco

nd

s

Size in MB

Avg time, n={3,4,5}, size=16MB to 1024MB

n(5) n(4) n(3)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

MB

/s

Size in MB

MB/s, n={3,4,5}, size=16MB to 8096MB

n(5) n(4) n(3)

3.3 Use Case 2 The second use case uses the GPS data to plot a colored frequency map of Singapore. The taxi company has divided Singapore into 86 different zones, most of them bordering neighborhood estates. These colored maps help identify zones which are having the highest frequency. In this use case the color represents the number of times a particular record appears in the log file. The following illustration (Figure 12) shows a representation of the algorithm.

Figure 12: Use case 2 algorithm

Much simpler than the previous use case, this use case consists of only a Mapper and a Reducer. However, in addition to this it uses the Java Topological Suite for simple GIS queries.

1. Mapper. It goes through each zone to check if the current record is in the zone. The zone array consists of polygon definitions of each of the 86 zones (shown in Figure 13). If a zone is found, it will write a <key, value> pair where the key is the zone number. If a zone is not found, possibly due to anomalous data, the key is -1.

2. Reducer. It simply counts the number of records for each zone. The output file is then fed into a JSP web application where it is parsed and colored using the Google Maps API (Figure 14).

Figure 13: Map of SIngapore sivided into the 86 zones

Figure 14: Map of Singapore with colors representing the frequency

This use case displays the potential for using Hadoop not simply as an end, but also as a means to an end, where that end could be the visualization of data. Colored maps showing the historical passenger demand could prove useful to taxi drivers as it could influence behavior and thus increase the overall efficiency of the taxi system.

4 Alternatives The alternatives to Hadoop include parallel databases such as those by Oracle and IBM. Increasingly, column-oriented databases are an excellent alternative on similar workload profiles. Various research papers have been published comparing these alternatives. There are other MapReduce solutions that are under research such as Dryad by Microsoft Research and Clustera by University of Wisconsin-Madison. These frameworks are still under heavy development.

5 Related Work Prior work in MapReduce and Hadoop has mostly focused on its inner mechanisms such schedulers, performance debugging and file systems. In “Improving MapReduce Performance in Heterogeneous

Environments” (Zaharia, Konwinski, Joseph, Katz, & Stoica, 2008), it takes a look at the improving the scheduler for heterogeneous compute environments. As mentioned in the previous section on the comparisons made between MapReduce and other alternatives, Pavlo, et al. (2009) has compared two approaches to large scale data analysis, the MapReduce model and parallel databases. They tested Hadoop against two parallel DBMSs, Vertica and another system from a major relational DB vendor. Other works on studying the use of Hadoop on datasets include Loebman, et al. (2009) for astrophysical simulations, comparing Hadoop and a commercial relational database. Cary, et al. (2009)explored the use of MapReduce for solving spatial problems.

6 Conclusion With the results from the use cases above, the taxi project now has the capability of analyzing data that spans across multiple months or years and receive results much more quickly than before. This saves the time of the faculty members by getting answers to their questions faster and, could possibly result in better quality research and better funding. In addition, the findings and efficiencies could go back to the taxi company and their drivers which could reap large economic and environmental benefits such as reduced fuel costs, increase driver revenue and lower carbon emissions. In the beginning, there are extensive references to the business use of analytics and how it creates a new dimension for competition. Hadoop has democratized data analytics, however there is still some ways to go before it is easy enough for businesses to take advantage of its capabilities. With heavy development still continuing, there is certainly potential for Hadoop to be improved upon for business analytics.

7 Future Work We have seen in the first use case where the scaling of the dataset is largely linear past the input split size. However, where does it end? Storage space limitations on the Beowulf-5 cluster caused us to stop at the 30GB file size. With the dataset at a couple hundred gigabytes, there is enough data to experiment on if the limitations are eased. One of the ongoing efforts is to get approval to use the Open Cirrus cloud computing testbed for this project to enable faster and larger analysis. The second use case provides us a glimpse of what could potentially be an interactive visual representation of the data on a web front end, driven by a Hadoop back-end. Even though the typical job is batch in nature, higher level projects such as Hbase, Pig and Hive give the developer the ability to drive real-time analytics while making the life of the developer easier.

Bibliography 1. Abadi, D. J., Madden, S. R., & Hachem, N. (2008). Column-stores vs. row-stores: how different

are they really? SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, (pp. 967-980).

2. Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Lecture Notes in Computer Science. 21st International Conference on Scientific and Statistical Database Management (pp. 302-319). Springer.

3. Davenport, T. H. (2006). Competing on Analytics. Harvard Business Review , 84 (1), p98-107.

4. IBM. (2009, July 28). IBM to Acquire SPSS Inc. to Provide Clients Predictive Analytics Capabilities. Retrieved Nov 2009, from IBM: http://www-03.ibm.com/press/us/en/pressrelease/27936.wss

5. IBM. (2009, Sep 10). New IBM Study Highlights Analytics As Top Priority For Today’s CIO. Retrieved November 2009, from IBM: http://www-03.ibm.com/press/us/en/pressrelease/28314.wss

6. Loebman, S., Nunley, D., Kwon, Y., Howe, B., Balazinska, M., & Gardner., J. P. (2009). Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help? Workshop on Interfaces and Architectures for Scientific Data Storage.

7. Manyika, J. M., Roberts, R. P., & Sprague, K. L. (2007). Eight Business Technology Trends to Watch. The McKinsey Quarterly , 1-11.

8. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., et al. (2009). A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data , (pp. 165-178). New York.

9. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., et al. (2005). C-store: a column-oriented dbms. VLDB '05: Proceedings of the 31st international conference on Very large data bases (pp. 553-564). VLDB Endowment.

10. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., & Stoica, I. (2008). Improving mapreduce performance in heterogeneous environments. 8th Symposium on Operating Systems Design and Implementation (pp. 29-42). USENIX Association.

Documents

Distributed Data Analysis Using Map-Reduce