paper8

International Journal of Software Engineering and Its Applications

Vol.8, No.8 (2014), pp. 73-82

http://dx.doi.org/10.14257/ijseia.2014.8.8,08

ISSN:1738-9984 IJSEIA

Copyright 2014 SERSC

A Study on Big Data Processing Mechanism & Applicability

Byung-Tae Chun1 and Seong-Hoon Lee

2

1Computer System Institute, Hankyong National University, 327, Chungang-no,

Anseong-si, Kyonggi-do, Korea 2Division of Information&Communication, Baekseok University, 115, Anseo-dong,

Cheonan, Choongnam, Korea [email protected],

[email protected]

Abstract

The technologies related with information communication regions are progressing

continuously. Our society has two prospective properties because of IT technology. Firstly, it

is accelerated a degree of convergence. And convergence regions are expanded. The efforts

to convergence will be continued. Because of these properties, various device types are

appeared in our life such as smart phone, tablet PC, game machine. Through these many

devices, various data types are produced. In this paper, we described applicability of Big

Data. And we analyzed Big Data process model.

Keywords: Big data, Applicability, Convergence, Hadoop

1.Introduction

The quantity of data that a baby born today is going to produce is 70 times larger than that currently stored in the U.S. Assembly library." "While one piece of information is stored,

there are some pieces of information which are not stored." YouTube video clips are uploaded one every 60 seconds. The statements above well imply the appearance of big data.

Today, the hot issues in IT industry include big data, cloud computing, and convergence.

Gartner, a research consultant agency, released the 10 major technologies and trends such as

war of mobile devices and strategic big data that companies should cope with in 2013.

Gartner predicted that in 2013, mobile phones would overtake PCs as a web access device

most widely used all over the world, and that by 2015, smart phones would account for more

than 80% of all mobile phones sold in advanced countries[1].

He also predicted that personal clouds would replace PCs as a space where individuals

store personal contents, access preferred services and objects, and concentrate one's digital

life. In addition, many organizations would provide the employees with mobile apps through

the exclusive app stores, and big data would be considered in strategic information

architectures among companies rather than mere individual projects.

When the term, 'big data,' first appeared, the meaning was interpreted differently. One

group defines it as "terabyte data," and another defines it as "architecture of processing a large

quantity of data." Since the meaning of the term, "big," itself is relative, however, it would

not be appropriate to define an absolute standard for the data capacity.

Big data is so large compared to existing data that it is difficult to collect, save, and

analyze the structured or unstructured data in application of existing ways or methods.

Mckinsey, one of the global consulting agencies, defined in one report released in 2011[2] big

data as "a dataset that exceeds the capacity of existing database management tools in data

collection, storage, management, and analysis," stating that "the definition is subjective and


Vol.8, No.8 (2014)

74 Copyright 2014 SERSC

will continue to change." The traditional concept of data and the characteristics of big data

which is in a spotlight now are compared in Table 1 below:

Table 1. Traditional Data/Big Data

Element technology of big data includes media-related data volume, data input/output

velocity, and data variety. Figure 1 shows three element technologies. The term, 'volume,'

means a data attribute of tens of terabytes or tens of peta bytes in general. 'Velocity' is an

attribute referred to in fast processing and analysis of large capacity data.

In a convergence environment, digital data is produced at a high speed, and thus the system

should be capable of saving, distributing, collecting, and analyzing it real time. 'Variety'

indicates that there are various types of data, and they could be classified to structured, semi-

structured, and unstructured data sets depending on the sort of structure. Table 2 shows three

types of big data.

Table 2. Types of Big Data

Structured Data stored in a fixed fieldRelational database, spreadsheet,

etc

Semi-

structured

Not stored in a fixed field but including metadata, schema,

etc; e.g., XML, HTML text, and so forth

Unstructured Data not stored in a fixed field; text-recognizable documents,

image/video/voice data, etc

Figure 1. Three Element Technologies of Big Data

It is reported that the quantity of data handled around the world today is doubled every two

years[3, 5-8]. As IT is converged with other industry sectors and a tremendous amount of data

is being generated, the utilization of big data has become a great issue in addressing desires

Traditional data Big data

Gigabytes to Terabytes Petabytes to Exabytes

Centralized Distributed

Structured Semi-structured and Unstructured

Stable data model Flat schemes

Known complex interrelationships Few complex interrelationships

Volume

Velocity

Variet


Vol.8, No.8 (2014)

Copyright 2014 SERSC 75

and demands regarding the quality of life in this changing society.

The most important factor in big data processing is the storage technology to collect

various types of gigantic data as mentioned above and the data analysis technology to analyze

it for a meaningful use. In this era of big data, new technologies such as hadoop have

emerged and provided functions to process and analyze data that the existing technologies did

not have[9]. Figure 2 shows the overview of Hadoop.

Figure 2. Structure of Hadoop

The utilization of big data now goes beyond the area of 'big data management' led by

business entities and expands into the area of public service for general peoples [4]. Big data

is made use of for improvement of national competitiveness, not merely for corporate

competitiveness.

Big data has been discussed mainly in the category of innovative business management in

an attempt to reflect market demands in corporate management by collecting and analyzing a

large quantity of data generated from various mobile devices and social media. Big data has

been also considered in minimizing product defects in reference to a tremendous amount of

data from the production line as well as planning systematic distribution tasks. Global large

IT companies have released big data solutions in domestic markets, focusing their corporate

management on product promotion and education.

Recently, however, as big data is introduced into public service sectors, its concept is

broadened to the area of the whole community as well as that of corporate management.In

other countries, new public service models have already been presented in combination of big

data and system integration(SI), providing civilians with high quality services. Big data

solutions now function as the brain of information-based systems of public agencies and play

an important role in enhancing the quality administrative services.

2. Big Data Processing Models

Various types of platforms handling big data basically consist of the three elements:

storage system, handling process, and analysis mechanism. This study focuses on the

platform technology related to the handling process among the three elements stated earlier.

Parallel DBMS and NoSQL, two storage systems, are same in that they adopt the horizontal

expansion approach for large quantity data storage. Besides, there are existing storage device

technologies such as SAN(Storage Area Network), NAS(Network Attached Storage), Cloud


Vol.8, No.8 (2014)


file storage system such as Amazon S3 and OpenStack Swift, and distribution file systems

such as GFS(Google File System), and HDFS(Hadoop Distributed File System). These are all

designed for large quantity data storage.

In a big data handling process, the core of parallel processing is 'Divide and Conquer,' that

is, to divide data into independent sets and handle them in parallel. Big data processing

divides a problem into multiple small operations, collects them, and combines them as one

single result. As for operation dependency, however, the advantage of parallel operation is

invalid. In reflection of this limitation, the proper data storage and processing method is

necessary. One of the well-known large quantity data processing technologies is the Map-Reduce

distribution data processing framework such as Apache Hadoop. Map-Reduce data processing

is illustrated in Figure 3.

Figure 3. Principle of Map-Reduce Data Processing.(Source: Amazon Web Service)

The characteristics of map reduce model are below.

A common embedded hard disk drive in a common computer may be used for the operation with no need for special storage means. As each computer has a quite weak

correlation to one another, it is possible to expand the link to hundreds or thousands of

units.

Since a number of computers are involved in the processing, it is assumed that malfunctions of the system including hardware are not exceptional but common.

Complicated problems can be solved with the simple and abstract basic operation of Map and Reduce. Even programmers unfamiliar with parallel programs can readily handle data

in parallel.

It can handle high throughput rates when a number of processors are used simultaneously.

Figure 4 shows the concept of Map-Reduce programming.


Vol.8, No.8 (2014)


Figure 4. Concept of Map-Reduce Programming

Dryad is a framework to form data channels between programs in graph and handle them

as data sets in parallel. Map-Reduce framework developers design Map and Reduce functions

while Dryad developers design data processing in graph. Dryad can process data flows in the

format of DAG(Direct Acyclic Graph).

Parallel data operation frameworks such as Map-Reduce and Dryad provide sufficient

functions to process big data, but they involve some barriers against inexperienced developers,

data analyzers, and data minors. It is necessary, therefore, to develop a method of a higher

level of abstraction to handle data in an easier manner. Apache Pig and Apache Hive to be

explained below are the two examples of such a framework.

Apache Pig provides a high level of large quantity data combining and processing structure.

Apache Pig supports the Pig Latin language. Pig Latin as the following characteristics:

A high level of structures such as relation, bag, and tuple are provided in addition to the basic types such as int, long, and double.

Relational operations(relation, table) such as FILTER, FOREACH, GROUP, JOIN, LOAD, and STORE are supported.

User-designated functions can be defined. A data processing program of Pig Latin is converted into a logic execution plan, which is

again converted into a Map-Reduce execution plan.

Apache Pig adopts an approach to designing a large quantity data processing program in a

format of procedural programming languages such as C and Java. Sawzall of Google adopts a

similar approach. Some technologies adopt declarative data processing methods such as SQL

instead of specifying data processing procedures as in a programming language. Apache Hive,

Google Tenzing, and Microsoft SCOPE are some of the examples.

Apache Hive is a technology to analyze large quantity original data sets such as HDFS and

HBase by means of a query language called HiveQL. It can be divided to the Map-Reduce

based execution part, metadata information on the data storage, and execution part based on

the queries received from users or applications in terms of architecture. To support user

expansion, it is possible to designate a user-defined function on the level of scalar value,

aggregation, and table.


Vol.8, No.8 (2014)


3. Big Data Applicability

Big data is based on data sets generated in all areas of human activity. Hence, such data

may belong to public service areas, companies, or daily life. This chapter, accordingly,

divides the utilization of big data to public service and general areas.

3.1. Public Services

Rio de Janeiro, Brazil, prepares for various kinds of possible urban disasters in utilization

of big data, which is a good example that existing big data is utilized for national

management rather than innovative management of a business entity.

In preparation for the World Cup Games in 2014 and the Olympic Games in 2016, Rio de

Janeiro has established its urban management and emergency response system through IOC.

Data and processes of more than 30 different agencies are integrated into IOC so that the

general movement in the city can be monitored 24 hours a day, 365 days a year.

IOC includes the integrated management system for water resources as well as traffic,

electric power, and natural disasters such as flood and landslide. In particular, the analysis

solution of IBM makes it possible to predict and respond to emergency situations effectively.

The high resolution weather forecast system and hydrologic modeling system provided by

IBM analyze a large quantity of data related to weather and hydrology and predict a heavy

rain 48 hours in advance.

Based on the integrated mathematical model of data extracted from river basin

topographical materials, precipitation statistics, and radar photographs, future precipitation

and sudden floods are predicted. In addition, other situations that affect the city such as heavy

traffic and blackout are also evaluated.

The new auto alarming system notifies municipal office officials and Emergency Response

Team of any changes in flood and landslide predictions over the area of Rio de Janeiro. This

type of alarming system makes use of instant communication tools such as auto email

notification, SMS text message, etc. to the emergency response team and civilians upon

emergency situations unlike previous systems that required handworks for notification, and

thus the response is more prompt and requires less time.

Singapore is going through serious traffic congestion due to the drastic increase of vehicles.

Big data is regarded as a new solution among Singapore administrative agencies.

Singapore operates a traffic prediction system called 'TPT' that goes beyond the existing

real time traffic information based on big-data analysis. Singapore traffic bureau(LTA)

predicts the urban traffic conditions and flows in utilization of the i-Transport system and

other prediction tools. Such a prediction system consists of the traffic flow analysis and

prediction sub-systems. As LTA traffic control center sends real time traffic data collected

from sensors, the traffic in one hour is predicted by modeling the traffic scenario. According

to IBM that provided the solution, the accuracy of the general prediction result is about 85%

and even higher particularly in business centers where the traffic is heavier.

DC Water, which manages the water and sewage system in Washington D.C. in the U.S.,

introduced the big data system for effective management of the sewage and collection system.

The prediction analysis system made it possible to manage the assets and facilities such as

sewage piping, valve, public water system, collection piping, manhole, and gauges effectively.

By using this system, the workers can check the location and status of company assets in

detailed maps and promptly address the asset details and total costs, local problems, different

types of regional water quality problems, and so forth.

In particular, D.C. Water prevented unexpected service suspension based on its prediction

analysis and established a new ratio model based on the service demands.


Vol.8, No.8 (2014)


In addition, the enhanced preventive measures and auto meter-reading reduced customer

calls as much as 36% and simplified the process. Currently, 93%(previously, 49%) of the

tasks are handled within 10 minutes.

3.2. Utilization for Living Facilities

What if an electricity charge 'bomb' fell on consumers? There might be no other solution

but 'making up one's mind to save electricity.' In particular, electricity and water charges do

not specify the details as if credit cards do, and thus there is no way to look into the specific

items but pay the charges every month. Big data, however, is expected to solve such problems.

Awarded 'MacArthur Fellow' Prize and working as an assistant professor at the department

of computer science, Washington University, Shwetak Patel has found the most reasonable

way of calculating utility bills. His idea is based on the fact that all devices for electricity,

water, and gas supply coming into the house have specific digital signals.

He designed a sensor that recognizes such signals through a simple algorithm, and this

sensor is installed at such areas as gas piping, electricity wiring, sewage piping, ventilator, etc

to produce and transmit digital signals to tablet PCs for real time checkup.

For instance, it is possible to check the amount of electricity consumed by a certain

electronic appliance and how much water or gas is being used. Patel installed the water and

electricity measuring sensor at an iPad to measure the usage at his cousin's place. As a result,

he found out that 11% of the electricity is consumed by the swimming pool electricity pump.

This innovative energy sensor invented by Shwetak Patel will be utilized in our daily life

in the near future. In 2010, Belkin bought the patent of the energy sensor from Shwetak Patel,

and the commercialization is in progress. This sensor will make it possible to grasp the exact

amount of electricity consumed by certain devices and thus to take proper preventive

measures. Besides, a large amount of information can be gained on the power supply to

hundreds of thousands of households in a future smart city.

Everyone living in Singapore knows that it is difficult to catch a taxi on a rainy day. The

Singapore-MIT Technology Institution conducted a project to compare 830 million GPS

records from a taxi running database and weather satellites for two months. In 2011, special

patterns were found in the middle of this project. As a result of analyzing data from more than

16,000 taxies, it turned out that many taxies would not move in a storm. GPS records too

showed that when it was raining, a number of taxi drivers stopped the car and no longer

picked up passengers.

Based on the analysis on taxi drivers' daily routine, the reasons were examined. Singapore

taxi companies required drivers to deposit 1,000 dollars unconditionally from their monthly

pays upon an accident until the close examination on the cause was completed. For this

reason, taxi drivers would stop the car and wait until the weather turned fine rather than take

risks on a rainy day. After finding out the cause, the company rules were modified in a better

way for the taxi companies, drivers, and passengers. This is one example that shows how big

data could improve the quality of civilians' life.

Data analysis is not limited to mere data reading but should involve insight to find

unexpected causes and results.

There is another example showing that big data analysis is an important means to read new

distribution trends. In the U.S., on Thursday in the fourth week of November, which is called

'Black Friday,' the day after the Thanksgiving day, the largest sales and shopping events of

the year are held. A big data analysis result, however, shows that 'Cyber Monday' is now

attracting more attention as a new shopping day than Black Friday.

Cyber Monday is the first Monday after the Thanksgiving day and weekend, on which

people back to work tended to be absorbed in online shopping, and this trend has caught


Vol.8, No.8 (2014)


attention of distributors. According to one report, the online sales on Cyber Monday last year

was 29.3% higher than that on Cyber Monday. The mobile traffic and mobile sales as well

were 10.8% and 6.6% respectively, which were close to 14.3% and 9.8% on Black Friday.

This result is based on the tera byte big data and the database of more than million

transactions a day from 500 major distributors all over the country, which were analyzed by

IBM.

4. Conclusions

Currently in our society, smart phones are commonly used, and various data producing

devices such as tablet PC, camera, and game console recently emerged, which have increased

the traffic drastically. In addition, as the volume and types of data become diversified and

data increase velocity is rapid, the era of 'big data's seems to be just ahead. Today, it is

reported that the amount of digital information handled around the world is doubled every

two years. As IT is converged with other industry sectors, a large quantity of data is produced

every day, and the issue of utilizing big data in addressing desires and demands for a better

quality of life in the changing society is in a spotlight.

Recently, companies seek ways of better corporate management and more efficient

marketing activity based on big data analysis. Accordingly, system streamlining in analysis of

big data is being sought in various areas including such public sectors as traffic system, water

resource system, security, system, tax evasion preventive system, and medical system.

Multiple platforms to handle big data basically consist of the storage system, handling

process, and analysis mechanism. This study aims to comparatively analyze platform

technologies related to the handling process among the three elements stated above.

In the future, the interest in and use of big data platforms will continue and expand. The

applicable area too will go beyond pure IT and be expanded to every possible sector. When

such efforts are consistently expanded and developed, the future society will open the door to

a world of infinite possibility.

References [1] Gartner, CEO Advisory: Big Data Equals Big Opportunity, (2011). [2] McKinsey, Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global

Institute, (2011).

[3] P. Warden, Big data Glossary, OReilly Media, (2011). [4] J. S. Jung, 3 Factors for Successful Big Data Usage: Resource, Technology, Manpower., Big Data Strategy

Forum, (2012).

[5] IDC, Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO, (2011). [6] IDC, New Analytics Strategies in the Big Data, Era, (2011). [7] R. Cattell, High Performance Scalable Data Stores, (2010). [8] MT Slon Management Review, Big Data, Analytics and the Path From Insights to Value, (2011). [9] J. Kelly, Big Data: Hadoop, Business Analytics and Beyond, Wikibon, (2012). [10] M. Hilbert, P. Lopez, The Worlds technological capacity to store, communicate and compute information,

Science, (2011).

[11] M. Choi and N. G. Kim, Future Smart Device Development Architecture, IJSEIA, vol. 7, no. 3, (2013), pp. 311-322.

[12] R. D. Caytiles and B. J. Park, A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services, IJSEIA, vol. 7, no. 2, (2013), pp. 219-226.

[13] H. Sug, Generating Better Radial Basis Function Network for Large Data Set of Census, IJSEIA, vol. 4, no. 2, (2010), pp. 15-22.


Vol.8, No.8 (2014)


Authors

Byung-Tae Chun, (PhD11) He received the Ph.D in computer

engineering from Korea University, in 2011. From 1989 to 1996, he

was a researcher at KIST(Korea Institute of Science and

Technology). From 1996 to 2004, he was a senior researcher at

Electronics and Telecommunication Research Institute). Since 2004,

he has been a professor at Hankyong National University, Korea. He

is a member IEEK and KIIT.

Seong-Hoon Lee, (MSc95PhD.98) He receivedthe M.Sc. degreeof Computer Science and Engineering from Korea University,

Seoul, Korea in 1995, and the Ph.D. degree of Computer Science

and Engineering from Korea University, KOREA in 1998. Since

1998, he is a Professor in School of Information and

Communication, BaekSeokUniversity, Korea. His main research

interests include Distributed system, Grid Computing, Web Service,

etc.

Copyright of International Journal of Software Engineering & Its Applications is the propertyof Science & Engineering Research Support soCiety and its content may not be copied oremailed to multiple sites or posted to a listserv without the copyright holder's express writtenpermission. However, users may print, download, or email articles for individual use.

Documents

paper8