Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
© 2016, IJARCSSE All Rights Reserved Page | 368
Volume 6, Issue 8, August 2016 ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com
Survey Paper on Big Data C. Lakshmi
*, V. V. Nagendra Kumar
MCA Department, RGMCET, Nandyal,
Andhra Pradesh, India
Abstract— Big data is the term for any collection of datasets so large and complex that it becomes difficult to process
using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing,
storage, transfer, visualization, and privacy violations. Big data is a set of techniques and technologies that require
new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a
massive scale. Big data environment is used to acquire, organize and analyze the various types of data. Data that is so
large in volume, so diverse in variety or moving with such velocity is called Big data. Analyzing Big Data is a
challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. The
technologies used by big data application to handle the massive data are Hadoop, Map Reduce, Apache Hive, No SQL
and HPCC. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic
framework to decompose big data systems into four sequential modules, namely data generation, data acquisition,
data storage, and data analytics. These four modules form a big data value chain. Following that, we present a
detailed survey of Materials and methods used in research and industry communities. In addition, we present the
prevalent Hadoop framework for addressing big data. Finally, we outline Big data system architecture and present key
challenges of research directions for big data system.
Keywords— Big Data, Hadoop, Map Reduce, Apache Hive, No SQL and HPC
I. INTRODUCTION
Big data is a largest buzz phrases in domain of IT, new technologies of personal communication driving the big data
new trend and internet population grew day by day but it never reach by 100%. The need of big data generated from the
large companies like facebook, yahoo, Google, YouTube etc for the purpose of analysis of enormous amount of data
which is in unstructured form or even in structured form. Google contains the large amount of information. So; there is
the need of Big Data Analytics that is the processing of the complex and massive datasets This data is different from
structured data in terms of five parameters –variety, volume, value, veracity and velocity (5V’s). The five V’s (volume,
variety, velocity, value, veracity) are the challenges of big data management are:
1. Volume:
Data is ever-growing day by day of alltypes ever MB, PB, YB, ZB, KB, TB of information. The data results into
large files. Excessive volume of data is main issue of storage. This main issue is resolved by reducing storage cost.
Data volumes are expected to grow 50 times by 2020.
2. Variety:
Data sources are extremely heterogeneous. The files comes in various formats and of any type, it may be structured
or unstructured such as text, audio, videos, log files and more. The varieties are endless, and the data enters the
network without having been quantified or qualified in any way.
3. Velocity:
The data comes at high speed.Sometimes 1 minute is too late so big data is time sensitive. Some organisations data
velocity is main challenge. The social media messages and credit card transactions done in millisecond and data
generated by this putting in to databases.
4. Value:
It is a most important v in big data. Value is main buzz for big data because it is important for businesses, IT
infrastructure system to store large amount of values in database.
5. Veracity: The increase in the range of valuestypical of a large data set. When we dealing with high volume, velocity and
variety of data, the all of data are not going 100% correct, there will be dirty data. Big data and analytics
technologies work with these types of data.
Huge volume of data (both structured and unstructured) is management by organization, administration and
governance. Unstructured data is a data that is not present in a database.
Unstructured data may be text, verbal data or in another form. Textual unstructured data is like power point
presentation, email messages, word documents, and instant massages. Data in another format can be.jpg images, .png
images and audio files.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 369
Fig. 1 Parameters of Big Data
Fig.2 illustrates a general big data network model with MapReduce. A distinct application in the cloud has put
demanding requirements for acquisition, transportation and analytics of structured and unstructured data.
The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy
violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to
"spot business trends, prevent diseases, and combat crime and so on". Scientists regularly encounter limitations due to
large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and
biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data
sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices,
aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification
(RFID) readers, and wireless sensor networks. The world's technological per-capita capacity to store information has
roughly doubled every 40 months since the 1980s;as of 2012, every day 2.5 exabytes (2.5×1018) of data were created.
The challenge for large enterprises is determining who should own big data initiatives that straddle the entire
organization.
Fig. 2 General Framework of Big Data Networking
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 370
The history of big data can be roughly split into the following stages:
Megabyte to Gigabyte: In the 1970s and 1980s, his-torical business data introduced the earliest ``big data''
challenge in moving from megabyte to gigabyte sizes. The urgent need at that time was to house that data and run
relational queries for business analyses and report-ing. Research efforts were made to give birth to the ``database
machine'' that featured integrated hardware and software to solve problems. The underlying philos-ophy was that such
integration would provide better per-formance at lower cost. After a period of time, it became clear that hardware-
specialized database machines could not keep pace with the progress of general-purpose com-puters. Thus, the
descendant database systems are soft-ware systems that impose few constraints on hardware and can run on general-
purpose computers.
Gigabyte to Terabyte: In the late 1980s, the popular-ization of digital technology caused data volumes to expand to
several gigabytes or even a terabyte, which is beyond the storage and/or processing capabilities of a single large
computer system. Data parallelization was proposed to extend storage capabilities and to improve performance by
distributing data and related tasks, such as building indexes and evaluating queries, into disparate hardware. Based on
this idea, several types of parallel databases were built, including shared-memory databases, shared-disk databases, and
shared-nothing databases, all as induced by the underlying hardware architecture. Of the three types of databasTerabyte
to Petabyte: During the late 1990s, whenthe database community was admiring its `` nished'' work on the parallel
database, the rapid development of Web 1.0 led the whole world into the Internet era, along with massive semi-structured
or unstructured web-pages holding terabytes or petabytes (PBs) of data.
The resulting need for search companies was to index and query the mushrooming content of the web. Unfor-
tunately, although parallel databases handle structured data well, they provide little support for unstructured data.
Additionally, systems capabilities were limited to less than several terabyte.
Petabyte to Exabyte: Under current development trends,data stored and analyzed by big companies will undoubt-
edly reach the PB to exabyte magnitude soon. However, current technology still handles terabyte to PB data; there has
been no revolutionary technology developed to cope with larger datasets.
II. LITERATURE SURVEY
1 Hadoop Map Reduce is a large scale, open source software framework dedicated to scalable, distributed, data-
intensive computing. The framework breaks up large data into smaller parallelizable chunks and handles scheduling ▫
Maps each piece to an intermediate value ▫ Reduces intermediate values to a solution ▫ User-specified partition and
combiner options Fault tolerant, reliable, and supports thousands of nodes and petabytes of data • If you can rewrite
algorithms into MapReduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoop’s
Map Reduce is the way to go for a distributed problem solving approach to large datasets • Tried and tested in production
• Many implementation options. We can present the design and evaluation of a data aware cache framework that requires
minimum change to the original Map Reduce programming model for provisioning incremental processing for Big Data
applications using the Map Reduce model [4].
2 The author [2] stated the importance of some of the technologies that handle Big Data like Hadoop, HDFS and
Map Reduce. The author suggested about various schedulers used in Hadoop and about the technical aspects of Hadoop.
The author also focuses on the importance of YARN which overcomes the limitations of Map Reduce.
3 The author [3] have surveyed various technologies to handle the big data and there architectures. In this paper we
have also discussed the challenges of Big data (volume, variety, velocity, value, veracity) and various advantages and a
disadvantage of these technologies. This paper discussed an architecture using Hadoop HDFS distributed data storage,
real-time NoSQL databases, and MapReduce distributed data processing over a cluster of commodity servers. The main
goal of our paper was to make a survey of various big data handling techniques those handle a massive amount of data
from different sources and improves overall performance of systems.
4 The author continue with the Big Data definition and enhance the definition given in [3] that includes the 5V Big
Data properties: Volume, Variety, Velocity, Value, Veracity, and suggest other dimensions for Big Data analysis and
taxonomy, in particular comparing and contrasting Big Data technologies in e-Science, industry, business, social media,
healthcare.
With a long tradition of working with constantly increasing volume of data, modern e-Science can offer industry the
scientific analysis methods, while industry can bring advanced and fast developing Big Data technologies and tools to
science and wider public.[1]
5 The author [6] stated the need to process enormous quantities of data has never been greater. Not only are terabyte
- and petabyte scale datasets rapidly becoming commonplace, but there is consensus that great value lies buried in them,
waiting to be unlocked by the right computational tools. In the commercial sphere, business intelligence, driven by the
ability to gather data from a dizzying array of sources. Big Data analysis tools like Map Reduce over Hadoop and HDFS,
promises to help organizations better understand their customers and the marketplace, hopefully leading to better
business decisions and competitive advantages [3].
6 The author [5] stated there is a need to maximize returns on BI investments and to overcome difficulties. Problems
and new trends mentioned in this article and finding solutions by combination of advanced tools, techniques and methods
would help readers in BI projects and implementations. BI vendors are struggling and doing continuous effort to bring
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 371
technical capabilities and to provide complete out of the box solution with set of tools and techniques. In 2014, due to
rapid change in BI maturity, BI teams are facing tough time to have infrastructure with less skilled resources.
Consolidation and convergence is going on, market is coming up with wide range of new technologies. Still the ground is
immature and in a state of rapid evolution.
7 The author [8] given some important emerging framework model design for Big Data Analytics and a 3-tier
architecture model for Big Data in Data Mining. In the proposed 3-tier architecture model is more scalable in working
with different environment and also benefits to overcome with the main issue in Big Data Analytics for storing,
Analyzing, and visualization. The framework model given for Hadoop HDFS distributed data storage, real-time Nosql
databases, and MapReduce distributed data processing over a cluster of commodity servers.
8 Big data framework needs to consider complex relationships between samples, models and data sources along with
their evolving changes with time and other possible factors. To support Big data mining high performance computing
platforms are required. With Big data technologies [3] we will hopefully be able to provide most relevant and most
accurate social sensing feedback to better understand our society at real time [7].
9 There are lots of scheduling technique are availab le to improve job performance but a ll the technique have some
litt le limitation so any one technique cannot overcome that particular para meter in which they effecting the performance
to whole system. Like data locality, fa irness, load balance, straggler proble m and deadline constrains. All the technique
has advantages over any other technique so if we co mbined or interchange some technique then the result will be even
much better than the individual scheduling technique [10].
10 The author [9] describes the concept of Big Data along with 3 Vs, Volume, Velocity and variety of Big Data. The
paper also focuses on Big Data processing problems. These technical challenges must be addressed for efficient and fast
processing of Big Data. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of
structure, error -handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from
data acquisition to result interpretation. These technical challenges are common across a large variety of application
domains, and therefore not cost -effective to address in the context of one domain alone. The paper describes Hadoop
which is an open source software used for processing of Big Data.
11 The author [12] proposed system is based on implementation of Online Aggregation of Map Reduce in Hadoop
for ancient big data processing. Traditional MapReduce implementations materialize the intermediate results of mappers
and do not allow pipelining between the map and the reduce phases. This approach has the advantageof simple recovery
in the case of failures, however, reducers cannot start executing tasks before all mappers have finished. As the Map
Reduce Online is a modeled version of Hadoop Map Reduce, it supports Online Aggregation and stream
processing,while also improving utilization and reducing response time.
12 The author [11] stated learning from the application studies, we explore the design space for supporting data-
intensive and compute-intensive applications on large data-center-scale computer systems. Traditional data processing
and storage approaches are facing many challenges in meeting the continuously increasing computing demands of Big
Data.
This work focused on Map Reduce, one of the key enabling approaches for meeting Big Data demands by means of
highly parallel processing on a large number of commodity nodes.
III. TECHNOLOGIES AND METHODS
All paragraphs must be indented.
Big data is a new concept for handling massive data therefore the architectural description of this technology is very
new. There are the different technologies which use almost same approach i.e. to distribute the data among various local
agents and reduce the load of the main server so that traffic can be avoided. There are endless articles, books and
periodicals that describe Big Data from a technology perspective so we will instead focus our efforts here on setting out
some basic principles and the minimum technology foundation to help relate Big Data to the broader IM domain.
A. Hadoop
Hadoop is a framework that can run applications on systems with thousands of nodes and terabytes. It distributes the
file among the nodes and allows to system continue work in case of a node failure. This approach reduces the risk of
catastrophic system failure.
In which application is broken into smaller parts (fragments or blocks).Apache Hadoop consists of the Hadoop
kernel, Hadoop distributed file system (HDFS), map reduce and related projects are zookeeper, Hbase, Apache Hive.
Hadoop Distributed File System consists of three Components: the Name Node, Secondary Name Node and Data Node.
The multilevel secure (MLS) environmental problems of Hadoop by using security enhanced Linux (SE Linux) protocol.
In which multiple sources of Hadoop applications run at different levels.
This protocol is an extension of Hadoop distributed file system. Hadoop is commonly used for distributed batch
index building; it is desirable to optimize the index capability in near real time. Hadoop provides components for storage
and analysis for large scale processing. Now a day’s Hadoop used by hundreds of companies.
The advantage of Hadoop is Distributed storage & Computational capabilities, extremely scalable, ptimized for high
throughput, large block sizes, tolerant of software and hardware failure.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 372
Fig. 3 Architecture of Hadoop
Components of Hadoop:
HBase: It is open source, distributed and Non-relational database system implemented in Java. It runs above the
layer of HDFS. It can serve the input and output for the Map Reduce in well mannered structure.
Oozie: Oozie is a web-application that runs in ajava servlet. Oozie use the database to gather the information of
Workflow which is a collection of actions. It manages the Hadoop jobs in a mannered way.
Sqoop: Sqoop is a command-line interface application that provides platform which is used for converting data from
relational databases and Hadoop or vice versa.
Avro: It is a system that provides functionality ofdata serialization and service of data exchange. It is basically used
in Apache Hadoop. These services can be used together as well as independently according the data records.
Chukwa: Chukwa is a framework that is used fordata collection and analysis to process and analyze the massive
amount of logs. It is built on the upper layer of the HDFS and Map Reduce framework.
Pig: Pig is high-level platform where the MapReduce framework is created which is used with Hadoop platform. It
is a high level data processing system where the data records are analyzed that occurs in high level language.
Zookeeper: It is a centralization based service thatprovides distributed synchronization and provides group services
along with maintenance of the configuration information and records.
Hive: It is application developed for datawarehouse that provides the SQL interface as well as relational model. Hive
infrastructure is built on the top layer of Hadoop that help in providing conclusion, and analysis for respective queries.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug Cutting, who was working at Yahoo! at the
time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine
project. Hadoop is open- source software that enables reliable, scalable, distributed computing on clusters of inexpensive
servers.
Hadoop is:
Reliable: The software is fault tolerant, it expectsand handles hardware and software failures
Scalable: Designed for massive scale ofprocessors, memory, and local attached storage
Distributed: Handles replication. Offers massivelyparallel programming model, Map Reduce.
Fig. 4 Hadoop System
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 373
Hadoop is an Open Source implementation of a large-scale batch processing system. That use the Map-Reduce
framework introduced by Google by leveraging the concept of map and reduce functions that well known used in
Functional Programming. Although the Hadoop framework is written in Java, it allows developers to deploy custom-
written programs coded in Java or any other language to process data in a parallel fashion across hundreds or thousands
of commodity servers. It is optimized for contiguous read requests(streaming reads), where processing includes of
scanning all the data. Depending on the complexity of the process and the volume of data, response time can vary from
minutes to hours. While Hadoop can processes data fast, so its key advantage is its massive scalability.
Hadoop is currently being used for index web searches, email spam detection, recommendation engines, prediction
in financial services, genome manipulation in life sciences, and for analysis of unstructured data such as log, text, and
clickstream. While many of these applications could in fact be implemented in a relational database(RDBMS), the main
core of the Hadoop framework is functionally different from an RDBMS. The following discusses some of these
differences Hadoop is particularly useful when:
Complex information processing is needed:
Unstructured data needs to be turned into structured data.
Queries can’t be reasonably expressed using SQL Heavily recursive algorithms.
Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
Machine learning:
Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB) .
Data value does not justify expense of constant real-time availability, such as archives or special interest info,
which can be moved to Hadoop and remain available at lower cost.
Results are not needed in real time Fault tolerance is critical.
Significant custom coding would be required to:
Handle job scheduling.
Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into
numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug
Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop
ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of
related projects such as Apache Hive, HBase and Zookeeper. The Hadoop framework is used by major players including
Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating
systems are Windows and Linux but Hadoop can also work with BSD and OS X.
HDFS
The Hadoop Distributed File System (HDFS) is the file system component of the Hadoop framework. HDFS is
designed and optimized to store data over a large amount of low-cost hardware in a distributed fashion.
Name Node:
Name node is a type of the master node, which is having the information that means meta data about the all data
node there is address(use to talk ), free space, data they store, active data node , passive data node, task tracker, job
tracker and many other configuration such as replication of data.
Fig. 5 HDFS Architecture
The NameNode records all of the metadata, attributes, and locations of files and data blocks in to the DataNodes.
The attributes it records are the things like file permissions, file modification and access times, and namespace, which is
a hierarchy of files and directories. The NameNode maps the namespace tree to file blocks in DataNodes. When a client
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 374
node wants to read a file in the HDFS it first contacts the Namenode to receive the location of the data blocks associated
with that file.
A NameNode stores information about the overall system because it is the master of the HDFS with the DataNodes
being the slaves. It stores the image and journal logs of the system. The NameNode must always store the most up to date
image and journal. Basically, the NameNode always knows where the data blocks and replicates are for each file and it
also knows where the free blocks are in the system so it keeps track of where future files can be written.
Data Node:
Data node is a type of slave node in the hadoop, which is used to save the data and there is task tracker in data node
which is use to track on the ongoing job on the data node and the jobs which coming from name node.
The DataNodes store the blocks and block replicas of the file system. During startup each DataNode connects and
performs a handshake with the NameNode. The DataNode checks for the accurate namespace ID, and if not found then
the DataNode automatically shuts down. New DataNodes can join the cluster by simply registering with the NameNode
and receiving the namespace ID. Each DataNode keeps track of a block report for the blocks in its node. Each DataNode
sends its block report to the NameNode every hour so that the NameNode always has an up to date view of where block
replicas are located in the cluster.During the normal operation of the HDFS, each DataNode also sends a heartbeat to the
NameNode every ten minutes so that the NameNode knows which DataNodes are operating correctly and are available.
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster.
Hadoop MapReduce – a programming model for large scale data processing.
Fig. 6 Hadoop Ecosystem
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus
should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components
originally derived respectively from Google's MapReduce and Google File System (GFS) papers."Hadoop" often refers
not to just the base Hadoop package but rather to the Hadoop Ecosystem fig.6 which includes all of the additional
software packages that can be installed on top of or alongside Hadoop, such as Apache Hive, Apache Pig and Apache
Spark.
B. Map Reduce:
Map-Reduce was introduced by Google in order to process and store large datasets on commodity hardware. Map
Reduce is a model for processing large-scale data records in clusters. The Map Reduce programming model is based on
two functions which are map() function and reduce() function. Users can simulate their own processing logics having
well defined map() and reduce() functions. Map function performs the task as the master node takes the input, divide into
smaller sub modules and distribute into slave nodes. A slave node further divides the sub modules again that lead to the
hierarchical tree structure. The slave node processes the base problem and passes the result back to the master Node. The
Map Reduce system arrange together all intermediate pairs based on the intermediate keys and refer them to reduce()
function for producing the final output. Reduce function works as the master node collects the results from all the sub
problems and combines them together to form the output.
Map(in_key,in_value)---
>list(out_key,intermediate_value)Reduce(out_key,list(intermediate_value))---
>list(out_value)
The parameters of map () and reduce () function is as follows:
map (k1,v1) ! list (k2,v2) and reduce (k2,list(v2)) ! list (v2) A Map Reduce framework is based on a master-slave architecture where one master node handles a number of slave
nodes . Map Reduce works by first dividing the input data set into even-sized data blocks for equal load distribution.
Each data block is then assigned to one slave node and is processed by a map task and result is generated. The slave node
interrupts the master node when it is idle. The scheduler then assigns new tasks to the slave node. The scheduler takes
data locality and resources into consideration when it disseminates data blocks.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 375
Fig. 7 Architecture of Map Reduce
Figure 7 shows the Map Reduce Architecture and Working. It always manages to allocate a local data block to a
slave node. If the effort fails, the scheduler will assign a rack-local or random data block to the slave node instead of
local data block. When map() function complete its task, the runtime system gather all intermediate pairs and launches a
set of condense tasks to produce the final output. Large scale data processing is a difficult task, managing hundreds or
thousands of processors and managing parallelization and distributed environments makes is more difficult. Map Reduce
provides solution to the mentioned issues, as is supports distributed and parallel I/O scheduling, it is fault tolerant and
supports scalability and it has inbuilt processes for status and monitoring of heterogeneous and large datasets as in Big
Data. It is way of approaching and solving a given problem. Using Map Reduce framework the efficiency and the time to
retrieve the data is quite manageable. To address the volume aspect, new techniques have been proposed to enable
parallel processing using Map Reduce framework. Data aware caching (Dache) framework that made slight change to the
original map reduce programming model and framework to enhance processing for big data applications using the map
reduce model.
The advantage of map reduce is a large variety of problems are easily expressible as Map reduce computations and
cluster of machines handle thousands of nodes and fault-tolerance.
The disadvantage of map reduce is Real-time processing, not always very easy to implement, shuffling of data, batch
processing.
Map Reduce Components:
1. Name Node: manages HDFS metadata, doesn’tdeal with files directly.
2. Data Node: stores blocks of HDFS—default replication level for each block: 3.
3. Job Tracker: schedules, allocates and monitorsjob execution on slaves—Task Trackers.
4. Task Tracker: runs Map Reduce operations.
Map Reduce Framework
Map Reduce is a software framework for distributed processing of large data sets on computer clusters. It is first
developed by Google .Map Reduce is intended to facilitate and simplify the processing of vast amounts of data in parallel
on large clusters of commodity hardware in a reliable, fault-tolerant manner.
MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.
Typical Hadoop cluster integrates MapReduce and HFDS layer. In MapReduce layer job tracker assigns tasks to the task
tracker.Master node job tracker also assigns tasks to the slave node task tracker fig.8.
Fig. 8 Map reduce is based on the Maser-Slave architecture
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 376
Master node contains -
Job tracker node (MapReduce layer) Task tracker node (MapReduce layer) Name node (HFDS layer)
Data node (HFDS layer)
Multiple slave nodes contain -
Task tracker node (MapReduce layer) Data node (HFDS layer)
MapReduce layer has job and task tracker nodes HFDS layer has name and data nodes
C. Hive:
Hive is a distributed agent platform, a decentralized system for building applications by networking local system
resources. Apache Hive data warehousing component, an element of cloud-based Hadoop ecosystem which offers a
query language called HiveQL that translates SQL-like queries into Map Reduce jobs automatically. Applications of
apache hive are SQL, oracle, IBM DB2. Architecture is divided into Map-Reduce-oriented execution, Meta data
information for data storage, and an execution part that receives a query from user or applications for execution.
Fig. 9 Architecture of HIVE
The advantage of hive is more secure and implementations are good and well tuned.The disadvantage of hive is only
for ad hoc queries and performance is less as compared to pig.
D. No-SQL:
No-SQL database is an approach to data management and data design that’s useful for very large sets of distributed
data. These databases are in general part of the real-time events that are detected in process deployed to inbound channels
but can also be seen as an enabling technology following analytical capabilities such as relative search applications.
These are only made feasible because of the elastic nature of the No-SQL model where the dimensionality of a query is
evolved from the data in scope and domain rather than being fixed by the developer in advance.
It is useful when enterprise need to access huge amount of unstructured data. There are more than one hundred No
SQL approaches that specialize in management of different multimodal data types (from structured to non-structured)
and with the aim to solve very specific challenges. Data Scientist, Researchers and Business Analysts in specific pay
more attention to agile approach that leads to prior insights into the data sets that may be concealed or constrained with a
more formal development process. The most popular No-SQL database is Apache Cassandra.
The advantage of No-SQL is open source, Horizontal scalability, Easy to use, store complex data types, Very fast for
adding new data and for simple operations/queries. The disadvantage of No-SQL is Immaturity, No indexing support, No
ACID, Complex consistency models, Absence of standardization.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 377
Fig. 10 Architecture of No SQL
E. HPCC:
HPCC is an open source platform used for computing and that provides the service for handling of massive big data
workflow. HPCC data model is defined by the user end according to the requirements. HPCC system is proposed and
then further designed to manage the most complex and data-intensive analytical related problems. HPCC system is a
single platform having a single architecture and a single programming language used for the data simulation.HPCC
system was designed to analyze the gigantic amount of data for the purpose of solving complex problem of big data.
HPCC system is based on enterprise control language which has the declarative and on-procedural nature programming
language the main components of HPCC are:
HPCC Data Refinery: Use parallel ETL enginemostly.
HPCC Data Delivery: It is massively based onstructured query engine used.
Enterprise Control Language distributes the workload between the nodes in appropriate even load.
IV. EXPERIMENTS ANALYSIS
Big-Data System Architecture:
In this section, we focus on the value chain for big data analytics. Speci cally, we describe a big data value chain that
consists of four stages (generation, acquisition, storage, and processing). Next, we present a big data technology map that
associates the leading technologies in this domain with speci c phases in the big data value chain and a time stamp.
Big-Data System: A Value-Chain View
A big-data system is complex, providing functions to deal with different phases in the digital data life cycle, ranging
from its birth to its destruction. At the same time, the system usually involves multiple distinct phases for different
applications. In this case, we adopt a systems-engineering approach, well accepted in industry, to decom-pose a typical
big-data system into four consecutive phases, including data generation, data acquisition, data storage, and data analytics,
as illustrated in the horizontal axis of Fig. 3. Notice that data visualization is an assistance method for data analysis. In
general, one shall visualize data to and some.
The details for each phase are explained as follows.
Data generation concerns how data are generated. In this case, the term ``big data'' is designated to mean large,
diverse, and complex datasets that are generated from various longitudinal and/or distributed data sources, including sen-
sors, video, click streams, and other available digital sources.
Normally, these datasets are associated with different levels of domain-speci c values. In this paper, we focus on
datasets from three prominent domains, business, Internet, and scienti c research, for which values are relatively easy to
understand.However, there are overwhelming technical chal-lenges in collecting, processing, and analyzing these
datasets that demand new solutions to embrace the latest advances in the information and communications technology
(ICT) domain.
Data acquisition refers to the process of obtaining informa-tion and is subdivided into data collection, data
transmission, and data pre-processing. First, because data may come from a diverse set of sources, websites that host
formatted text, images and/or videos - data collection refers to dedicated data collection technology that acquires raw
data from a spe-ci c data production environment. Second, after collecting raw data, we need a high-speed transmission
mechanism to transmit the data into the proper storage sustaining system for various types of analytical applications.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 378
Fig. 11 Big data technology map. It pivots on two axes, i.e., data value chain and timeline. The data value chain
divides thedata lifecycle into four stages, including data generation, data acquisition, data storage, and data analytics. In
each stage, we highlight exemplary technologies over the past 10 years. rough patterns rst, and then employ speci c data
mining methods. I mention this in data analytics section.
Finally, collected datasets might contain many meaningless data, which unnecessarily increases the amount of storage
space and affects the consequent data analysis. For instance, redundancyis common in most datasets collected from
sensors deployed to monitor the environment, and we can use data compres-sion technology to address this issue. Thus,
we must per-form data pre-processing operations for ef cient storage and mining.
Data storage concerns persistently storing and managinglarge-scale datasets. A data storage system can be divided
into two parts: hardware infrastructure and data manage-ment. Hardware infrastructure consists of a pool of shared ICT
resources organized in an elastic way for various tasks in response to their instantaneous demand. The hardware
infrastructure should be able to scale up and out and be able to be dynamically recon gured to address different types of
application environments. Data management software is deployed on top of the hardware infrastructure to main-tain
large-scale datasets. Additionally, to analyze or interact with the stored data, storage systems must provide several
interface functions, fast querying and other programming models.
Data analysis leverages analytical methods or tools toinspect, transform, and model data to extract value. Many
application elds leverage opportunities presented by abun-dant data and domain-speci c analytical methods to derive the
intended impact. Although various elds pose dif-ferent application requirements and data characteristics, a few of these
elds may leverage similar underlying tech-nologies. Emerging analytics research can be classi ed into six critical
technical areas: structured data analytics, text analytics, multimedia analytics, web analytics, net-work analytics, and
mobile analytics.
Big-Data System: A Layered View:
Alternatively, the big data system can be decomposed into a layered structure, as illustrated in Fig. 12.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 379
Fig. 12 Layered architecture of big data system. It can be decomposedinto three layers, including infrastructure layer,
computing layer, and application layer, from bottom to up.
The layered structure is divisible into three layers, i.e., the infrastructure layer, the computing layer, and the
application layer, from bottom to top. This layered view only provides a conceptual hierarchy to underscore the
complexity of a big data system. The function of each layer is as follows.
The infrastructure layer consists of a pool of ICT resources, which can be organized by cloud computing
infrastructure and enabled by virtualization technology. These resources will be exposed to upper-layer systems in a ne-
grained manner with a speci c service-level agreement (SLA). Within this model, resources must be allocated to meet the
big data demand while achieving resource ef ciency by maximizing system utilization, energy awareness, operational
simpli cation, etc.
The computing layer encapsulates various data tools into a middleware layer that runs over raw ICT resources. In
the context of big data, typical tools include data inte-gration, data management, and the programming model. Data
integration means acquiring data from disparate sources and integrating the dataset into a uni ed form with the necessary
data pre-processing operations. Data management refers to mechanisms and tools that provide persistent data storage and
highly ef cient management, such as distributed le systems and SQL or NoSQL data stores. The programming model
implements abstraction application logic and facilitates the data analysis appli-cations. MapReduce, Dryad, Pregel, and
Dremel exemplify programming models.
The application layer exploits the interface provided by the programming models to implement various data
analysis functions, including querying, statistical anal-yses, clustering, and classi cation; then, it combines basic
analytical methods to develop various led related applications. McKinsey presented ve potential big data application
domains: health care, public sector admin-istration, retail, global manufacturing, and personal location data.
Big-Data System Challenges:
Designing and deploying a big data analytics system is not a trivial or straightforward task. As one of its de nitions
suggests, big data is beyond the capability of current hard-ware and software platforms. The new hardware and software
platforms in turn demand new infrastructure and models to address the wide range of challenges of big data. Recent
works have discussed potential obstacles to the growth of big data applications. In this paper, we strive to classify these
challenges into three categories: data collection and management, data analytics, and system issues.
Data collection and management addresses massive amounts of heterogeneous and complex data. The following
challenges of big data must be met:
Data Representation: Many datasets are heterogeneousin type, structure, semantics, organization, granular-ity, and
accessibility. A competent data presentation should be designed to re ect the structure, hierarchy, and diversity of the
data, and an integration technique should be designed to enable ef cient operations across different datasets.
Redundancy Reduction and Data Compression: Typi-cally, there is a large number of redundant data in raw
datasets. Redundancy reduction and data compression without scarifying potential value are ef cient ways to lessen
overall system overhead.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 380
Data Life-Cycle Management: Pervasive sensing andcomputing is generating data at an unprecedented rate and
scale that exceed much smaller advances in storage system technologies. One of the urgent challenges is that the current
storage system cannot host the massive data. In general, the value concealed in the big data depends on data freshness;
therefore, we should set up the data importance principle associated with the analysis value to decide what parts of the
data should be archived and what parts should be discarded.
Data Privacy and Security: With the proliferation ofonline services and mobile phones, privacy and security
concerns regarding accessing and analyzing personal information is growing. It is critical to understand what support for
privacy must be provided at the platform level to eliminate privacy leakage and to facilitate var-ious analyses.
There will be a signi cant impact that results from advances in big data analytics, including interpretation, mod-eling,
prediction, and simulation. Unfortunately, massive amounts of data, heterogeneous data structures, and diverse
applications present tremendous challenges, such as the following.
Approximate Analytics: As data sets grow and the real-time requirement becomes stricter, analysis of the entire
dataset is becoming more dif cult. One way to poten-tially solve this problem is to provide approximate results, such as
by means of an approximation query. The notion of approximation has two dimensions: the accuracy of the result and the
groups omitted from the output.
Connecting Social Media: Social media possessesunique properties, such as vastness, statistical redun-dancy and
the availability of user feedback. Various extraction techniques have been successfully used to identify references from
social media to speci c product names, locations, or people on websites. By connect-ing inter- eld data with social media,
applications can achieve high levels of precision and distinct points of view.
Deep Analytics: One of the drivers of excitementaround big data is the expectation of gaining novel insights.
Sophisticated analytical technologies, such as machine learning, are necessary to unlock such insights. However,
effectively leveraging these analysis toolkits requires an understanding of probability and statistics. The potential pillars
of privacy and security mechanisms are mandatory access control and security communication, multi-granularity access
control, privacy-aware data mining and analysis, and security storage and management.
Finally, large-scale parallel systems generally confront sev-eral common issues; however, the emergence of big data
has ampli ed the following challenges, in particular.
Energy Management: The energy consumption of large-scale computing systems has attracted greater concern
from economic and environmental perspectives. Data transmission, storage, and processing will inevitably consume
progressively more energy, as data volume and analytics demand increases. Therefore, system-level power control and
management mechanisms must be considered in a big data system, while continuing to provide extensibility and
accessibility.
Scalability: A big data analytics system must be able tosupport very large datasets created now and in the future. All
the components in big data systems must be capable of scaling to address the ever-growing size of complex datasets.
Collaboration: Big data analytics is an interdisciplinaryresearch eld that requires specialists from multiple
professional elds collaborating to mine hidden val-ues. A comprehensive big data cyber infrastructure is necessary to
allow broad communities of scientists and engineers to access the diverse data, apply their respec-tive expertise, and
cooperate to accomplish the goals of analysis.
V. CONCLUSIONS
In this paper we have surveyed various technologies to handle the big data and there architectures. In this paper we
have also discussed the challenges of Big data (volume, variety, velocity, value, veracity) and various advantages and a
disadvantage of these technologies. This paper discussed an architecture using Hadoop HDFS distributed data storage,
real-time NoSQL databases, and MapReduce distributed data processing over a cluster of commodity servers. The main
goal of our paper was to make a survey of various big data handling techniques those handle a massive amount of data
from different sources and improves overall performance of systems.
REFERENCES
[1] Yuri Demchenko ―The Big Data Architecture Framework (BDAF)‖ Outcome of the Brainstorming Session at
the University of Amsterdam 17 July 2013.
[2] Amogh Pramod Kulkarni, Mahesh Khandewal, ―Survey on Hadoop and Introduction to YARN‖, International
Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO
9001:2008 Certified Journal, Volume 4, Issue 5, May 2014).
[3] Sagiroglu, S.Sinanc, D.,‖Big Data: A Review‖,2013, 20-24.
[4] Ms. Vibhavari Chavan, Prof. Rajesh. N. Phursule, ―Survey Paper On Big Data‖ International Journal of
Computer Science and Information Technologies, Vol. 5 (6), 2014.
[5] Margaret Rouse, April 2010―unstructured data‖.
[6] Kyuseok Shim, MapReduce Algorithms for Big Data Analysis, DNIS 2013, LNCS 7813, pp. 44–48, 2013.
[7] Dong, X.L.; Srivastava, D. Data Engineering (ICDE),‖ Big data integration― IEEE International Conference on ,
29(2013) 1245–1248.
Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),
August- 2016, pp. 368-381
© 2016, IJARCSSE All Rights Reserved Page | 381
[8] Tekiner F. and Keane J.A., Systems, Man and Cybernetics (SMC), ―Big Data Framework‖ 2013 IEEE
International Conference on 13–16 Oct. 2013, 1494–1499.
[9] Mrigank Mridul, Akashdeep Khajuria, Snehasish Dutta, Kumar N ―Analysis of Big Data using Apache Hadoop
and Map Reduce‖ Volume 4, Issue 5, May 2014‖.
[10] Suman Arora, Dr.Madhu Goel, ―Survey Paper on Scheduling in Hadoop‖ International Journal of Advanced
Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014.
[11] Aditya B. Patel, Manashvi Birla and Ushma Nair, "Addressing Big Data Problem Using Hadoop and Map
Reduce," in Proc. 2012 Nirma University International Conference On Engineering.
[12] Jimmy Lin ―Map Reduce Is Good Enough?‖ The control project, IEEE Computer 32 (2013).