Upload
renjan131
View
305
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A simple explanation of Map Reduce and related technologies
Citation preview
Study of Map Reduce and related technologies
Applied management Research Project
(January 2011 - April 2011)
Vinod Gupta School of Management
IIT Kharagpur
Submitted By
Renjith Peediackal
Roll no. 09BM8040
MBA, Batch of 2009-11
Under the guidance of
Prof. Prithwis Mukherjee
Project Guide
VGSOM, IIT Kharagpur
Certificate of Examination
This is to certify that we have examined the summer project report on “Study of Map Reduce
and related technologies” and accept this in partial fulfillment for the MBA degree for which it
has been submitted. This approval does not endorse or accept every statement made, opinion
expressed or conclusion drawn, as recorded in this internship report. It is limited to the
acceptance of the project report for the applicable purpose.
Examiner (External) ____________________________________________________
Additional Examiner ____________________________________________________
Supervisor (Academic) __________________________________________________
Date: ________________________________________________________________
UNDERTAKING REGARDING PLAGIARISM
This declaration is for the purpose of the Applied Management Research Project on Study of
Map Reduce and related technologies, which is in partial fulfillment for the degree of Master
of Business Administration.
I hereby declare that the work in this report is authentic and genuine to the best of my knowledge
and I would be liable for appropriate punitive action if found otherwise.
Renjith Peediackal
Roll No. 09BM8040
Vinod Gupta School of Management
IIT Kharagpur
ACKNOWLEDGEMENT
I am highly thankful to Professor Prithwis Mukherjee, esteemed professor at Vinod
Gupta School of Management, IIT Kharagpur for providing me a chance to carry out my work
on map reduce technology as a review paper in point for this project. I am also thankful to him
for time to time interactions and guidance that he provided me in the course of this project. Also
I thank my classmates in the course ‘IT for BI’ for giving me support and feedback for my work.
Renjith Peediackal,
VGSOM,IIT Kharagpur
Page | 5
UNDERTAKING REGARDING PLAGIARISM 3
ACKNOWLEDGEMENT 4
EXECUTIVE SUMMARY 7
INTRODUCTION 8
THE CASE FOR MAP REDUCE 9
The complexity of input data 9
Information is sparse in the humungous data 9
Appetite for personalization 9
EXAMPLE OF A PROGRAMME USING INTERNET DATA 9
Recommendation systems 9
Some of the problems with recommendation systems 10
Popularity 11
Integration of expert opinion 12
Integration of internet data 12
WHAT IS DATA IN FLIGHT? 14
Map and reduce. 14
HOW DOES A MAP REDUCE PROGRAM WORK 15
Map 16
Shuffle or sort 17
Reduce 17
How this new way helpful to us in accommodating complex internet data for our recommendation system? 18
Brute power 18
Fault tolerance 19
Criticisms 21
Why it is valuable still? 22
An example 23
Sequential web access-based recommendation system 24
PARALLEL DB VS MAPREDUCE 29
Page | 6
SUMMARY OF ADVANTAGES OF MR 29
WHAT IS ‘HADOOP’? 30
HYBRID OF MAP REDUCE AND RELATIONAL DATABASE 30
PIG 31
WHAT DOES HIVE DO? 32
NEW MODELS: CLOUD 33
PATENT CONCERNS 33
SCOPE OF FUTURE WORK 34
REFERENCES 34
Page | 7
Executive Summary
The data explosion happen in the internet has forced the technology wizards to redesign
the tools and techniques used to handle that data. ‘Map reduce’, ‘Hadoop’ and ‘Data in flight’
has been buzz words in the last few years. Many of the information available regarding these
technologies are, on one hand absolutely technology oriented or on the other hand marketing
documents of the companies which produce tools based on the technology. While
technological language is difficult for a normal business user to understand it, marketing
documents seems to be very superficial. We see the importance of an article, which explains
the technology in a lucid and simple way. Then the importance and future prospects of this
movement can be understood by business users and adopted this mission as part of the
advanced management research project in the 4th
semester.
Web data available today is dynamic, sparse and unstructured mostly. And processing
the data with existing techniques like parallel relational database management systems is
nearly impossible. And the programming effort need to do this in the older way is humungous
and there is no answer to the question whether anybody will be able to do it practically. The
new technologies like ‘hadoop’ create an easy platform to do parallel data processing in an
easier and cost effective way. In this research paper, the author is trying to explain ‘Hadoop’
and related technology in simple way. In future the author can continue similar effort to
demystify many of the tools and techniques coming from Hadoop community and make it
friendly for the business community.
Page | 8
Introduction
Web data available today is dynamic, sparse and unstructured mostly. And most of the
people who generates or uses this content is customers or prospects for business organizations.
Therefore the management community, especially marketers is eagerly looking forward to
analyze the information coming out of this internet data, so as to personalize their product and
services. Also processing the data can be vital for governments and social organizations too.
The earlier systems like parallel relational databases were not efficient in processing the
web data. Data in flight refers to the technology of capturing and processing the dynamic data
as and when it is produced and consumed. Google introduced map reduce paradigm for
efficiently processing the web data. Apache organization following the footsteps of map reduce,
gave birth to an open source movement ‘hadoop’ to develop programs and tools to process
internet data.
In this review paper, a humble attempt to explain this technology in a simple and clear manner
has been taken. I hope the readers can appreciate the efforts.
Page | 9
The case for Map Reduce
The complexity of input data
In today content rich world the enormity and complexity of data available to decision
makers is huge. And when we add the internet to the picture, the volume takes a big bang
explosion. And this data is unstructured and produced and consumed by random people at
different parts of the globe.
Information is sparse in the humungous data
The information is actually hidden in the sea of data produced and consumed.
Appetite for personalization
The perpetual problem of the marketer is to personalize their products to attract and
retain customers. They look forward to information that can be derived from internet to
understand their customers better.
Example of a programme using internet data
Recommendation systems
Take the example of a recommendation system. Customer Y buys product X from an e-
commerce site after going through a number of products X1, X2, X3, X4. And this behavior is
repeated by 100 customers! And there are customers, who dropped from shopping after going
through X1 to X3.
Page | 10
Should we have suggested those people about X before exiting from the site? But if we
are recommending the wrong product, how will it affect the buying decision?
Questions like this also arise: "Why am I selling X number of figurines, and not Y [a
higher number] number of figurines? What pages are the people who buy things reading? If I'm
running promotions, which Web sites are sending me traffic with people who make purchases?
[ Avinash Kaushik – Analytics expert] And also What kind of content I need to develop in my site
so as to attract the right set of people. Your URL should be present in what kind of sites so that
you get maximum number of referral? How many of them quit after seeing the homepage?
What different kind of design can be possible to make them go forward? Are the users clicking
on the right links in the right fashion in your websites? (Site overlay) What is the bounce rate?
How to save money on PPC schemes? Also how can we get data regarding competitors
marketing activities and how it influences the customers?
Many of the people are using net to sell or at least market their products and services.
How can we analyze the user behavior in such cases to benefit our organization? Example:
using the information from something like face book shopping page.
And most critical question to analytics professional is, can we make robust algorithms to
predict such behavior. How much data we need to process? Is point of sales history enough?
Some of the problems with recommendation systems
Page | 11
Popularity
Recommendation system has a basic logic of processing historical data of customer behaviour
and predicting their behavior. So naturally the most “popular” product top the list of
recommendations. Here popular means previously favored the most by the same segment of
customers, for a particular product category. This is dangerous to companies in many ways
1. Customer need not be satisfied perpetually by same products
2. Companies have to create niche products and up sell and cross sell it to customers to
satisfy them retain them and thus to be successful in the market. Popularity based
system ruins this possibilities of exploration!
3. Opportunity of selling a product is lost!
4. Lack of personalization leads to broken relations
Page | 12
Integration of expert opinion
How do we integrate the inferences given by experts in the particular business domain
to the process of statistical analysis? This is a mix of art with science and nobody knows the
right blend. Will this undermine the basic concept of data driven decision making?
Integration of internet data
To attack the current deficiencies, one approach is the usage of more data to increase
the scope and relevance of the analysis. The users of mail, social networking groups, or niche
groups for specific purposes are all consumers and potential customers. Also bloggers,
information providers can play the role of opinion leaders. So if we are able to integrate the
information scattered over the internet to our decision making models, we can better
understand different segments of customers, their current and future needs.
But the fact is that the data available in the internet is not at all friendly for a statistical
analysis using current DBMS technology on a normal level of cost. It is dynamic, sparse and
mostly unstructured. Also the question of who can program such a system to handle web data
using current technology is difficult to answer.
Page | 13
Growth of data: mindboggling.
– Published content: 3-4 Gb/day
– Professional web content: 2 Gb/day
– User generated content: 5-10 Gb/day
– Private text content: ~2 Tb/day (200x more)
(Ref: Raghu Ramakrishnan
http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf)
Questions to this data: very demanding business users.
– Can we do Analytics over Web Data / User Generated Content?
– Can we handle TB of text data / GB of new data getting added to internet each
day?
Page | 14
– Structured Queries, Search Queries for unstructured data?
– Can we expect the tools to give answers at “Google-Speed”?
That gives us a strong case for adopting the new technology of data in flight. ‘Map
Reduce’ is a technology developed by Google for the similar purposes. Those
technologies are explained one by one in following paragraphs.
What is Data in flight?
Earlier data was at ‘rest’!
The normal concepts of DBMS where data is at rest and the queries hit those static data
and fetch results
Now data is just flying in!
The new concepts of ‘data in flight’ envisages the already prepared query as static,
collecting dynamic data as and when it is produced and consumed. But this needs more
powerful tools to handle the enormity, complexity and ‘scarce’ nature of information contained
in the data.
Map and reduce.
A map operation is needed to translate the scarce information available in numerous
formats to some forms which can be processed easily by an analytical tool. Once the
information is in simpler and structured form, it can be reduced to the required results.
Page | 15
A standard example:
Word count!
Given a document, how many of each word are there?
But in real world it can be:
� Given our search logs, how many people click on result 1
� Given our flicker photos, how many cat photos are there by users in each
geographic region
� Give our web crawl, what are the 10 most popular words?
Word count and twitter
� Tweets can be used to get early warnings on epidemic like swine flue
� Tweets can be used to understand the ‘mood’ of people in a region and can be
used for different purposes, even subliminal marketing
� The software created by Dr Peter Dodds and Dr Chris Danforth of the University
of Vermont, collects sentences from blogs and 'tweets‘, zeroing in on the
happiest and saddest days of the last few years.
� Can it prevent social crises?
How does a map reduce program work
Page | 16
Programmer has to specify two methods: Map and Reduce rest will be taken care by the
platform.
Map
map (k, v) -> <k', v'>*
1. Specify a map function that takes a key(k)/value(v) pair.
a. key = document URL, value = document contents
a. “document1”, “to be or not to be”
2. Output of map is (potentially many) key/value pairs. <k', v'>*
3. In our case, output (word, “1”) once per word in the document
a. “to”, “1”
b. “be”, “1”
Page | 17
– “or”, “1”
– “to”, “1”
– “not”, “1”
– “be”, “1”
Shuffle or sort
The items which are logically near is brought near to each other physically.
– “to”, “1”
– “to”, “1”
– “be”, “1”
– “be”, “1”
– “not”, “1”
– “or”, “1”
Reduce
reduce (k', <v'>*) -> <k', v'>*
• The reduce function combines the values for a key
Page | 18
– “be”, “2”
– “not”, “1”
– “or”, “1”
– “to”, “2”
For different use cases functions within map and reduce differs, but the architecture
and the supporting platform remains the same. For example our recommendation system can
use a map function to get any meaningful association between two product names X and Y in
any online discussion to a pair of product combination and count. Later in reduce step based on
the count of combinations arising, the information can be reduced in to suggestion of X given Y.
This is a very simplified explanation of the algorithms. Refer to papers on algorithms such
parallel affinity propagation to understand the complexity of such a process. This is a new way
of writing algorithms. Let us see how it helps to accommodate humungous internet data.
How this new way helpful to us in accommodating complex internet data for our
recommendation system?
Brute power
One view is could be that map reduce algorithms are effective because it
– Uses the brute power of many machines to map the huge chunk of sparse data
into small table of dense data
– The complex and time consuming part of the “task” is done on the new, small
and dense data in reduce part
Page | 19
– Means, it separate huge data from the time consuming part of the algorithm,
albeit a lot of disk space is utilized.
Fault tolerance
There are two aspects of fault tolerance
1. Transaction should not be affected by the machine failure. One step would be making
the machine less prone to failure. And ask the user to start the transaction from
scratch in case of failure. Saving transaction data permanently (to save some of the
steps) somewhere will reduce the speed of the transaction. Normal DB operation
give emphasize to this kind of fault tolerance in its implementation.
Fault tolerance: RDBMS school of thought
Page | 20
2. Complex query execution. Here the only point in fault tolerance is that the same query
should be processed again from which ever state it can be restarted. Saving data in
intermediate forms can be desirable. Map Reduce give emphasize to this kind of
tolerance in its implementation. The extra time taken to save permanently is
reduced by parallelization.
Fault tolerance: MR school of thought
Hierarchy of Parallelism:
Page | 21
Effectively Map reduce philosophy of fault tolerance can be summarized with following
diagram.
Cycle of brute force fault tolerance
Criticisms
1. A giant step backward in the programming paradigm for large-scale data
intensive applications
2. A sub-optimal implementation
3. in that it uses brute force instead of indexing
4. Not novel at all
5. it represents a specific implementation of well known techniques developed
25 years ago
6. Missing most features in current DBMS
Page | 22
7. Incompatible with all of the tools DBMS users have come to depend on
Why it is valuable still?
The biggest reason is that, It helps the programmers to logically manage the complexity of the
data
Also intermediate permanent writing magically enables two different wonderful features we
need for our recommendation system.
1. It raises the fault tolerance level to such a level, that we can employ millions
of cheap computers to get our work done.
Searching entire social networks to find out any mention of our product from different
segments of customers can be assigned to numerous computers and done fast enough.
2. It brings dynamism and load balancing. If one of the social networks having
denser data and our set of machines take very long time to process, we can
reassign the task to many more set of free machines to speed up the process.
This is essential when we don’t know the nature of the data we are going to
process.
So the parallelism achieved by parallel DBMS (normally done by dividing an execution plan
across a limited number of processors) stand to be nothing in front what is achieved by map
reduce.
Page | 23
� At large scales, super-fancy reliable hardware still fails, albeit less often
� software still needs to be fault-tolerant
� commodity machines without fancy hardware give better perf/$
� Usage of more memory to speed up querying has its own implication on
tolerance and cost
An example
This example invites you to the complex world of sequential web access-based
recommendation system. Its working is explained in the following diagram:
Page | 24
Sequential web access-based recommendation system
It goes through web server logs, mines the pattern in the sequence and then creates a pattern
tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou
et al]
System architecture of sequential web access system [Zhou et al]
The target is to create a pattern tree.
And when a particular user has to be catered with a suggestion
Page | 25
– His access pattern tree is compared with the entire tree of patterns.
– And the most suitable portions of the tree in comparison with the user’s pattern
are selected and
– the branches of those nodes are suggested.
Some details of the algorithm
• Let E be a set of unique access events, which represents web resources accessed by
users, i.e. web pages, URLs, topics or categories
• A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events
Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d, e, f) a
sample database will be like
Session ID Web access sequence
1 abdac
2 eaebcac
3 babfae
4 abbacfc
Page | 26
Access events can be classified into frequent and infrequent based on frequency crossing a
threshold level
And a tree consisting of frequent access events can be created.
Length of sequence Sequential web access pattern with support
1 a:4. b:4, c:3
2 aa:4. ab:4. oc3. ba:4. bc:3
3 aac:3, aba;4, obc:3, bac:3
4 Abac:3
Finally the tree is formed:
Page | 27
The Map and reduce
1. So a map job can be designed to process the logs and create pattern tree.
2. The task is divided among thousands of cheap machines using map Reduce platform.
3. dynamic data and the static query model of data in flight will be very helpful to modify
the main tree
4. The tree structure can be efficiently stored by altering the physical storage by sorting
and partitioning.
Page | 28
5. Then based on the user’s access pattern we have to select a few parts of the tree. This
can be designed as a reduce job which runs across the tree data.
DBMS for the same case?
• Map
– A huge data base of access logs should be uploaded to a db. And then it should
be updated at regular intervals to reflect the changes in the site usage.
– Then a query has to be written to get tree kind of data structure out of this data
behemoth, which changes shape continuously!
– An execution plan, which is simplistic and non dynamic in nature has to be made.
Ineffective
– It should be divided among many parallel engines
– And this requires expertise in parallel programming.
• Reduce
– During reduce phase the entire tree has to be searched for the existence of
resembling patterns.
– This also will be ineffective in an execution plan driven model as explained
above.
Page | 29
• And with the explosion of data, and the increased need of increased personalization in
recommendations, map reduce becomes the most suitable pattern.
Parallel DB vs MapReduce
• RDBMS is good when
– if the application is query-intensive,
– whether semi structured or rigidly structured
• MR is effective
– ETL and “read once” data sets.
– Complex analytics.
– Semi-structured data, Non structured
– Quick-and-dirty analyses.
– Limited-budget operations.
Summary of advantages of MR
• Storage system independence
• automatic parallelization
• load balancing
Page | 30
• network and disk transfer optimization
• handling of machine failures
• Robustness
• Improvements to core library benefit all users of library!
• Ease to programmers!
What is ‘Hadoop’?
Based on the map Reduce paradigm, apache foundation has given rise to a program for
developing tools and techniques on an open source platform. This program and the resultant
technology are referred to as ‘Hadoop’.
The icon of Hadoop
Hybrid of Map Reduce and Relational Database
Page | 31
As pointed out earlier RDBMS technology is suitable when the data to be processed is
structured. Many of the intermediate forms while processing web data also can be
structured. For that reasons many of the researchers have come up with a hybrid
system, where the unstructured data will be taken care by map reduce platform and the
structured part is handled by RDBMS. HadoopDB is one of those proposed systems.
Pig
If we have established a Hadoop infrastructure, can we use the same for processing repetitive
tasks? How can we use it as efficiently as a relational data base handling repetitive tasks? The
answer is a programming language called ‘Pig’. Pig helps us to write an execution plan, just like
the ones we have in relational database.
Page | 32
• Pig allows the programmer to control the flow of data by prewritten codes. It is suitable
when the tasks are repetitive and the plans can be envisaged early on.
What does hive do?
Users of databases are not often technology masters. They might be familiar to the existing
platforms. And these platforms tend to generate SQL like queries. We need a program to
convert this traditional SQL queries into mapReduce jobs. And the one created by Hadoop
movement is Hive. So from outside it looks like an SQL query or a button click which generates
SQL query, but the undergoing process is done the map reduce way.
Page | 33
New models: cloud
1. Many services available on cloud like Amazon web services (Amazon elastic -
http://aws.amazon.com/ec2/)
2. The user gets MR services by entering input text or site name, the required output etc
without going to the technical details
3. Almost infinite scalability
4. New business models which are efficient
Patent Concerns
Excerpts from a Slashdot comment on Jan 19, 2011
“But the very public complaints didn't stop Google from demanding a patent for
MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections).
On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large-
Scale Data Processing.”
Page | 34
• Will Google enforce the patent?
• If it does it will hamper the growth of Hadoop community.
Scope of future work
Hadoop is an open source movement. So the author can contribute to the Hadoop user
community by studying more about the tools and techniques and sharing them on the internet.
Also implementing Hadoop in a lab and understanding many detailed aspects of map reduce
environment also can be done in the future. This is essential to get the complete picture about
these new technologies. Also new use case for this technology can be developed to utilize the
power of Hadoop to solve business problems.
References
1. MapReduce and Parallel DBMSs:Friends or Foes?
Michael St onebraker, Daniel Abad i,Dav id J. eWitt, Sam Maden, Erik Paulson, Andrew
Pavlo, and Alexander Rasin
Communicat ions of the acm January 2010
2. Web warehousing: Web technology meets data warehousing
Xin Tan, David C. Yen ∗, Xiang Fang
3. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch
McKinsey Quarterly
4. Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation
Page | 35
Wenhua Wang, Hanwang Zhang, Fei Wu, and Yueting Zhuang
City University of Hong Kong, Hong Kong, August 2008.
5. Experiences with Map Reduce, an Abstraction for Large-Scale Computation
Jeff Dean Google, Inc.
6. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
Analytical Workloads
Azza Abouzeid1, Kamil BajdaPawlikowski1, August 2009
7. Parallel Collection of Live Data Using Hadoop
Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George
Stephanides
Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece
8. Hive – A Petabyte Scale Data Warehouse Using Hadoop
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning
Zhang, Suresh Antony, Hao Liu and Raghotham Murthy
9. Massive Structured Data Management Solution
Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania
IBM Research – India
10. Situational Business Intelligence
Alexander Löser, Fabian Hueske, and Volker Markl, TU Berlin
Database System and Information Management Group
11. An Intelligent Recommender System using Sequential Web Access Patterns: Alexander
Löser
http://user.cs.tu-berlin.de/~aloeser
12. http://hadoop.apache.org/
13. http://en.wikipedia.org
14. http://cloudera.com
15. Slashdot.org
16. Amazon.com