Upload
vladimir-calle-mayser
View
220
Download
1
Embed Size (px)
DESCRIPTION
deeee
Citation preview
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014), pp. 73-82
http://dx.doi.org/10.14257/ijseia.2014.8.8,08
ISSN:1738-9984 IJSEIA
Copyright 2014 SERSC
A Study on Big Data Processing Mechanism & Applicability
Byung-Tae Chun1 and Seong-Hoon Lee
2
1Computer System Institute, Hankyong National University, 327, Chungang-no,
Anseong-si, Kyonggi-do, Korea 2Division of Information&Communication, Baekseok University, 115, Anseo-dong,
Cheonan, Choongnam, Korea [email protected],
Abstract
The technologies related with information communication regions are progressing
continuously. Our society has two prospective properties because of IT technology. Firstly, it
is accelerated a degree of convergence. And convergence regions are expanded. The efforts
to convergence will be continued. Because of these properties, various device types are
appeared in our life such as smart phone, tablet PC, game machine. Through these many
devices, various data types are produced. In this paper, we described applicability of Big
Data. And we analyzed Big Data process model.
Keywords: Big data, Applicability, Convergence, Hadoop
1.Introduction
The quantity of data that a baby born today is going to produce is 70 times larger than that currently stored in the U.S. Assembly library." "While one piece of information is stored,
there are some pieces of information which are not stored." YouTube video clips are uploaded one every 60 seconds. The statements above well imply the appearance of big data.
Today, the hot issues in IT industry include big data, cloud computing, and convergence.
Gartner, a research consultant agency, released the 10 major technologies and trends such as
war of mobile devices and strategic big data that companies should cope with in 2013.
Gartner predicted that in 2013, mobile phones would overtake PCs as a web access device
most widely used all over the world, and that by 2015, smart phones would account for more
than 80% of all mobile phones sold in advanced countries[1].
He also predicted that personal clouds would replace PCs as a space where individuals
store personal contents, access preferred services and objects, and concentrate one's digital
life. In addition, many organizations would provide the employees with mobile apps through
the exclusive app stores, and big data would be considered in strategic information
architectures among companies rather than mere individual projects.
When the term, 'big data,' first appeared, the meaning was interpreted differently. One
group defines it as "terabyte data," and another defines it as "architecture of processing a large
quantity of data." Since the meaning of the term, "big," itself is relative, however, it would
not be appropriate to define an absolute standard for the data capacity.
Big data is so large compared to existing data that it is difficult to collect, save, and
analyze the structured or unstructured data in application of existing ways or methods.
Mckinsey, one of the global consulting agencies, defined in one report released in 2011[2] big
data as "a dataset that exceeds the capacity of existing database management tools in data
collection, storage, management, and analysis," stating that "the definition is subjective and
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
74 Copyright 2014 SERSC
will continue to change." The traditional concept of data and the characteristics of big data
which is in a spotlight now are compared in Table 1 below:
Table 1. Traditional Data/Big Data
Element technology of big data includes media-related data volume, data input/output
velocity, and data variety. Figure 1 shows three element technologies. The term, 'volume,'
means a data attribute of tens of terabytes or tens of peta bytes in general. 'Velocity' is an
attribute referred to in fast processing and analysis of large capacity data.
In a convergence environment, digital data is produced at a high speed, and thus the system
should be capable of saving, distributing, collecting, and analyzing it real time. 'Variety'
indicates that there are various types of data, and they could be classified to structured, semi-
structured, and unstructured data sets depending on the sort of structure. Table 2 shows three
types of big data.
Table 2. Types of Big Data
Structured Data stored in a fixed fieldRelational database, spreadsheet,
etc
Semi-
structured
Not stored in a fixed field but including metadata, schema,
etc; e.g., XML, HTML text, and so forth
Unstructured Data not stored in a fixed field; text-recognizable documents,
image/video/voice data, etc
Figure 1. Three Element Technologies of Big Data
It is reported that the quantity of data handled around the world today is doubled every two
years[3, 5-8]. As IT is converged with other industry sectors and a tremendous amount of data
is being generated, the utilization of big data has become a great issue in addressing desires
Traditional data Big data
Gigabytes to Terabytes Petabytes to Exabytes
Centralized Distributed
Structured Semi-structured and Unstructured
Stable data model Flat schemes
Known complex interrelationships Few complex interrelationships
Volume
Velocity
Variet
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
Copyright 2014 SERSC 75
and demands regarding the quality of life in this changing society.
The most important factor in big data processing is the storage technology to collect
various types of gigantic data as mentioned above and the data analysis technology to analyze
it for a meaningful use. In this era of big data, new technologies such as hadoop have
emerged and provided functions to process and analyze data that the existing technologies did
not have[9]. Figure 2 shows the overview of Hadoop.
Figure 2. Structure of Hadoop
The utilization of big data now goes beyond the area of 'big data management' led by
business entities and expands into the area of public service for general peoples [4]. Big data
is made use of for improvement of national competitiveness, not merely for corporate
competitiveness.
Big data has been discussed mainly in the category of innovative business management in
an attempt to reflect market demands in corporate management by collecting and analyzing a
large quantity of data generated from various mobile devices and social media. Big data has
been also considered in minimizing product defects in reference to a tremendous amount of
data from the production line as well as planning systematic distribution tasks. Global large
IT companies have released big data solutions in domestic markets, focusing their corporate
management on product promotion and education.
Recently, however, as big data is introduced into public service sectors, its concept is
broadened to the area of the whole community as well as that of corporate management.In
other countries, new public service models have already been presented in combination of big
data and system integration(SI), providing civilians with high quality services. Big data
solutions now function as the brain of information-based systems of public agencies and play
an important role in enhancing the quality administrative services.
2. Big Data Processing Models
Various types of platforms handling big data basically consist of the three elements:
storage system, handling process, and analysis mechanism. This study focuses on the
platform technology related to the handling process among the three elements stated earlier.
Parallel DBMS and NoSQL, two storage systems, are same in that they adopt the horizontal
expansion approach for large quantity data storage. Besides, there are existing storage device
technologies such as SAN(Storage Area Network), NAS(Network Attached Storage), Cloud
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
76 Copyright 2014 SERSC
file storage system such as Amazon S3 and OpenStack Swift, and distribution file systems
such as GFS(Google File System), and HDFS(Hadoop Distributed File System). These are all
designed for large quantity data storage.
In a big data handling process, the core of parallel processing is 'Divide and Conquer,' that
is, to divide data into independent sets and handle them in parallel. Big data processing
divides a problem into multiple small operations, collects them, and combines them as one
single result. As for operation dependency, however, the advantage of parallel operation is
invalid. In reflection of this limitation, the proper data storage and processing method is
necessary. One of the well-known large quantity data processing technologies is the Map-Reduce
distribution data processing framework such as Apache Hadoop. Map-Reduce data processing
is illustrated in Figure 3.
Figure 3. Principle of Map-Reduce Data Processing.(Source: Amazon Web Service)
The characteristics of map reduce model are below.
A common embedded hard disk drive in a common computer may be used for the operation with no need for special storage means. As each computer has a quite weak
correlation to one another, it is possible to expand the link to hundreds or thousands of
units.
Since a number of computers are involved in the processing, it is assumed that malfunctions of the system including hardware are not exceptional but common.
Complicated problems can be solved with the simple and abstract basic operation of Map and Reduce. Even programmers unfamiliar with parallel programs can readily handle data
in parallel.
It can handle high throughput rates when a number of processors are used simultaneously.
Figure 4 shows the concept of Map-Reduce programming.
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
Copyright 2014 SERSC 77
Figure 4. Concept of Map-Reduce Programming
Dryad is a framework to form data channels between programs in graph and handle them
as data sets in parallel. Map-Reduce framework developers design Map and Reduce functions
while Dryad developers design data processing in graph. Dryad can process data flows in the
format of DAG(Direct Acyclic Graph).
Parallel data operation frameworks such as Map-Reduce and Dryad provide sufficient
functions to process big data, but they involve some barriers against inexperienced developers,
data analyzers, and data minors. It is necessary, therefore, to develop a method of a higher
level of abstraction to handle data in an easier manner. Apache Pig and Apache Hive to be
explained below are the two examples of such a framework.
Apache Pig provides a high level of large quantity data combining and processing structure.
Apache Pig supports the Pig Latin language. Pig Latin as the following characteristics:
A high level of structures such as relation, bag, and tuple are provided in addition to the basic types such as int, long, and double.
Relational operations(relation, table) such as FILTER, FOREACH, GROUP, JOIN, LOAD, and STORE are supported.
User-designated functions can be defined. A data processing program of Pig Latin is converted into a logic execution plan, which is
again converted into a Map-Reduce execution plan.
Apache Pig adopts an approach to designing a large quantity data processing program in a
format of procedural programming languages such as C and Java. Sawzall of Google adopts a
similar approach. Some technologies adopt declarative data processing methods such as SQL
instead of specifying data processing procedures as in a programming language. Apache Hive,
Google Tenzing, and Microsoft SCOPE are some of the examples.
Apache Hive is a technology to analyze large quantity original data sets such as HDFS and
HBase by means of a query language called HiveQL. It can be divided to the Map-Reduce
based execution part, metadata information on the data storage, and execution part based on
the queries received from users or applications in terms of architecture. To support user
expansion, it is possible to designate a user-defined function on the level of scalar value,
aggregation, and table.
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
78 Copyright 2014 SERSC
3. Big Data Applicability
Big data is based on data sets generated in all areas of human activity. Hence, such data
may belong to public service areas, companies, or daily life. This chapter, accordingly,
divides the utilization of big data to public service and general areas.
3.1. Public Services
Rio de Janeiro, Brazil, prepares for various kinds of possible urban disasters in utilization
of big data, which is a good example that existing big data is utilized for national
management rather than innovative management of a business entity.
In preparation for the World Cup Games in 2014 and the Olympic Games in 2016, Rio de
Janeiro has established its urban management and emergency response system through IOC.
Data and processes of more than 30 different agencies are integrated into IOC so that the
general movement in the city can be monitored 24 hours a day, 365 days a year.
IOC includes the integrated management system for water resources as well as traffic,
electric power, and natural disasters such as flood and landslide. In particular, the analysis
solution of IBM makes it possible to predict and respond to emergency situations effectively.
The high resolution weather forecast system and hydrologic modeling system provided by
IBM analyze a large quantity of data related to weather and hydrology and predict a heavy
rain 48 hours in advance.
Based on the integrated mathematical model of data extracted from river basin
topographical materials, precipitation statistics, and radar photographs, future precipitation
and sudden floods are predicted. In addition, other situations that affect the city such as heavy
traffic and blackout are also evaluated.
The new auto alarming system notifies municipal office officials and Emergency Response
Team of any changes in flood and landslide predictions over the area of Rio de Janeiro. This
type of alarming system makes use of instant communication tools such as auto email
notification, SMS text message, etc. to the emergency response team and civilians upon
emergency situations unlike previous systems that required handworks for notification, and
thus the response is more prompt and requires less time.
Singapore is going through serious traffic congestion due to the drastic increase of vehicles.
Big data is regarded as a new solution among Singapore administrative agencies.
Singapore operates a traffic prediction system called 'TPT' that goes beyond the existing
real time traffic information based on big-data analysis. Singapore traffic bureau(LTA)
predicts the urban traffic conditions and flows in utilization of the i-Transport system and
other prediction tools. Such a prediction system consists of the traffic flow analysis and
prediction sub-systems. As LTA traffic control center sends real time traffic data collected
from sensors, the traffic in one hour is predicted by modeling the traffic scenario. According
to IBM that provided the solution, the accuracy of the general prediction result is about 85%
and even higher particularly in business centers where the traffic is heavier.
DC Water, which manages the water and sewage system in Washington D.C. in the U.S.,
introduced the big data system for effective management of the sewage and collection system.
The prediction analysis system made it possible to manage the assets and facilities such as
sewage piping, valve, public water system, collection piping, manhole, and gauges effectively.
By using this system, the workers can check the location and status of company assets in
detailed maps and promptly address the asset details and total costs, local problems, different
types of regional water quality problems, and so forth.
In particular, D.C. Water prevented unexpected service suspension based on its prediction
analysis and established a new ratio model based on the service demands.
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
Copyright 2014 SERSC 79
In addition, the enhanced preventive measures and auto meter-reading reduced customer
calls as much as 36% and simplified the process. Currently, 93%(previously, 49%) of the
tasks are handled within 10 minutes.
3.2. Utilization for Living Facilities
What if an electricity charge 'bomb' fell on consumers? There might be no other solution
but 'making up one's mind to save electricity.' In particular, electricity and water charges do
not specify the details as if credit cards do, and thus there is no way to look into the specific
items but pay the charges every month. Big data, however, is expected to solve such problems.
Awarded 'MacArthur Fellow' Prize and working as an assistant professor at the department
of computer science, Washington University, Shwetak Patel has found the most reasonable
way of calculating utility bills. His idea is based on the fact that all devices for electricity,
water, and gas supply coming into the house have specific digital signals.
He designed a sensor that recognizes such signals through a simple algorithm, and this
sensor is installed at such areas as gas piping, electricity wiring, sewage piping, ventilator, etc
to produce and transmit digital signals to tablet PCs for real time checkup.
For instance, it is possible to check the amount of electricity consumed by a certain
electronic appliance and how much water or gas is being used. Patel installed the water and
electricity measuring sensor at an iPad to measure the usage at his cousin's place. As a result,
he found out that 11% of the electricity is consumed by the swimming pool electricity pump.
This innovative energy sensor invented by Shwetak Patel will be utilized in our daily life
in the near future. In 2010, Belkin bought the patent of the energy sensor from Shwetak Patel,
and the commercialization is in progress. This sensor will make it possible to grasp the exact
amount of electricity consumed by certain devices and thus to take proper preventive
measures. Besides, a large amount of information can be gained on the power supply to
hundreds of thousands of households in a future smart city.
Everyone living in Singapore knows that it is difficult to catch a taxi on a rainy day. The
Singapore-MIT Technology Institution conducted a project to compare 830 million GPS
records from a taxi running database and weather satellites for two months. In 2011, special
patterns were found in the middle of this project. As a result of analyzing data from more than
16,000 taxies, it turned out that many taxies would not move in a storm. GPS records too
showed that when it was raining, a number of taxi drivers stopped the car and no longer
picked up passengers.
Based on the analysis on taxi drivers' daily routine, the reasons were examined. Singapore
taxi companies required drivers to deposit 1,000 dollars unconditionally from their monthly
pays upon an accident until the close examination on the cause was completed. For this
reason, taxi drivers would stop the car and wait until the weather turned fine rather than take
risks on a rainy day. After finding out the cause, the company rules were modified in a better
way for the taxi companies, drivers, and passengers. This is one example that shows how big
data could improve the quality of civilians' life.
Data analysis is not limited to mere data reading but should involve insight to find
unexpected causes and results.
There is another example showing that big data analysis is an important means to read new
distribution trends. In the U.S., on Thursday in the fourth week of November, which is called
'Black Friday,' the day after the Thanksgiving day, the largest sales and shopping events of
the year are held. A big data analysis result, however, shows that 'Cyber Monday' is now
attracting more attention as a new shopping day than Black Friday.
Cyber Monday is the first Monday after the Thanksgiving day and weekend, on which
people back to work tended to be absorbed in online shopping, and this trend has caught
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
80 Copyright 2014 SERSC
attention of distributors. According to one report, the online sales on Cyber Monday last year
was 29.3% higher than that on Cyber Monday. The mobile traffic and mobile sales as well
were 10.8% and 6.6% respectively, which were close to 14.3% and 9.8% on Black Friday.
This result is based on the tera byte big data and the database of more than million
transactions a day from 500 major distributors all over the country, which were analyzed by
IBM.
4. Conclusions
Currently in our society, smart phones are commonly used, and various data producing
devices such as tablet PC, camera, and game console recently emerged, which have increased
the traffic drastically. In addition, as the volume and types of data become diversified and
data increase velocity is rapid, the era of 'big data's seems to be just ahead. Today, it is
reported that the amount of digital information handled around the world is doubled every
two years. As IT is converged with other industry sectors, a large quantity of data is produced
every day, and the issue of utilizing big data in addressing desires and demands for a better
quality of life in the changing society is in a spotlight.
Recently, companies seek ways of better corporate management and more efficient
marketing activity based on big data analysis. Accordingly, system streamlining in analysis of
big data is being sought in various areas including such public sectors as traffic system, water
resource system, security, system, tax evasion preventive system, and medical system.
Multiple platforms to handle big data basically consist of the storage system, handling
process, and analysis mechanism. This study aims to comparatively analyze platform
technologies related to the handling process among the three elements stated above.
In the future, the interest in and use of big data platforms will continue and expand. The
applicable area too will go beyond pure IT and be expanded to every possible sector. When
such efforts are consistently expanded and developed, the future society will open the door to
a world of infinite possibility.
References [1] Gartner, CEO Advisory: Big Data Equals Big Opportunity, (2011). [2] McKinsey, Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global
Institute, (2011).
[3] P. Warden, Big data Glossary, OReilly Media, (2011). [4] J. S. Jung, 3 Factors for Successful Big Data Usage: Resource, Technology, Manpower., Big Data Strategy
Forum, (2012).
[5] IDC, Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO, (2011). [6] IDC, New Analytics Strategies in the Big Data, Era, (2011). [7] R. Cattell, High Performance Scalable Data Stores, (2010). [8] MT Slon Management Review, Big Data, Analytics and the Path From Insights to Value, (2011). [9] J. Kelly, Big Data: Hadoop, Business Analytics and Beyond, Wikibon, (2012). [10] M. Hilbert, P. Lopez, The Worlds technological capacity to store, communicate and compute information,
Science, (2011).
[11] M. Choi and N. G. Kim, Future Smart Device Development Architecture, IJSEIA, vol. 7, no. 3, (2013), pp. 311-322.
[12] R. D. Caytiles and B. J. Park, A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services, IJSEIA, vol. 7, no. 2, (2013), pp. 219-226.
[13] H. Sug, Generating Better Radial Basis Function Network for Large Data Set of Census, IJSEIA, vol. 4, no. 2, (2010), pp. 15-22.
International Journal of Software Engineering and Its Applications
Vol.8, No.8 (2014)
Copyright 2014 SERSC 81
Authors
Byung-Tae Chun, (PhD11) He received the Ph.D in computer
engineering from Korea University, in 2011. From 1989 to 1996, he
was a researcher at KIST(Korea Institute of Science and
Technology). From 1996 to 2004, he was a senior researcher at
Electronics and Telecommunication Research Institute). Since 2004,
he has been a professor at Hankyong National University, Korea. He
is a member IEEK and KIIT.
Seong-Hoon Lee, (MSc95PhD.98) He receivedthe M.Sc. degreeof Computer Science and Engineering from Korea University,
Seoul, Korea in 1995, and the Ph.D. degree of Computer Science
and Engineering from Korea University, KOREA in 1998. Since
1998, he is a Professor in School of Information and
Communication, BaekSeokUniversity, Korea. His main research
interests include Distributed system, Grid Computing, Web Service,
etc.
Copyright of International Journal of Software Engineering & Its Applications is the propertyof Science & Engineering Research Support soCiety and its content may not be copied oremailed to multiple sites or posted to a listserv without the copyright holder's express writtenpermission. However, users may print, download, or email articles for individual use.