10
International Journal of Software Engineering and Its Applications Vol.8, No.8 (2014), pp. 73-82 http://dx.doi.org/10.14257/ijseia.2014.8.8,08 ISSN:1738-9984 IJSEIA Copyright ⓒ 2014 SERSC A Study on Big Data Processing Mechanism & Applicability Byung-Tae Chun 1 and Seong-Hoon Lee 2 1 Computer System Institute, Hankyong National University, 327, Chungang-no, Anseong-si, Kyonggi-do, Korea 2 Division of Information&Communication, Baekseok University, 115, Anseo-dong, Cheonan, Choongnam, Korea 1 [email protected], 2 [email protected] Abstract The technologies related with information communication regions are progressing continuously. Our society has two prospective properties because of IT technology. Firstly, it is accelerated a degree of convergence. And convergence regions are expanded. The efforts to convergence will be continued. Because of these properties, various device types are appeared in our life such as smart phone, tablet PC, game machine. Through these many devices, various data types are produced. In this paper, we described applicability of Big Data. And we analyzed Big Data process model. Keywords: Big data, Applicability, Convergence, Hadoop 1.Introduction “The quantity of data that a baby born today is going to produce is 70 times larger than that currently stored in the U.S. Assembly library." "While one piece of information is stored, there are some pieces of information which are not stored." “YouTube video clips are uploaded one every 60 seconds.” The statements above well imply the appearance of big data. Today, the hot issues in IT industry include big data, cloud computing, and convergence. Gartner, a research consultant agency, released the 10 major technologies and trends such as war of mobile devices and strategic big data that companies should cope with in 2013. Gartner predicted that in 2013, mobile phones would overtake PCs as a web access device most widely used all over the world, and that by 2015, smart phones would account for more than 80% of all mobile phones sold in advanced countries[1]. He also predicted that personal clouds would replace PCs as a space where individuals store personal contents, access preferred services and objects, and concentrate one's digital life. In addition, many organizations would provide the employees with mobile apps through the exclusive app stores, and big data would be considered in strategic information architectures among companies rather than mere individual projects. When the term, 'big data,' first appeared, the meaning was interpreted differently. One group defines it as "terabyte data," and another defines it as "architecture of processing a large quantity of data." Since the meaning of the term, "big," itself is relative, however, it would not be appropriate to define an absolute standard for the data capacity. Big data is so large compared to existing data that it is difficult to collect, save, and analyze the structured or unstructured data in application of existing ways or methods. Mckinsey, one of the global consulting agencies, defined in one report released in 2011[2] big data as "a dataset that exceeds the capacity of existing database management tools in data collection, storage, management, and analysis," stating that "the definition is subjective and

paper8

Embed Size (px)

DESCRIPTION

deeee

Citation preview

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014), pp. 73-82

    http://dx.doi.org/10.14257/ijseia.2014.8.8,08

    ISSN:1738-9984 IJSEIA

    Copyright 2014 SERSC

    A Study on Big Data Processing Mechanism & Applicability

    Byung-Tae Chun1 and Seong-Hoon Lee

    2

    1Computer System Institute, Hankyong National University, 327, Chungang-no,

    Anseong-si, Kyonggi-do, Korea 2Division of Information&Communication, Baekseok University, 115, Anseo-dong,

    Cheonan, Choongnam, Korea [email protected],

    [email protected]

    Abstract

    The technologies related with information communication regions are progressing

    continuously. Our society has two prospective properties because of IT technology. Firstly, it

    is accelerated a degree of convergence. And convergence regions are expanded. The efforts

    to convergence will be continued. Because of these properties, various device types are

    appeared in our life such as smart phone, tablet PC, game machine. Through these many

    devices, various data types are produced. In this paper, we described applicability of Big

    Data. And we analyzed Big Data process model.

    Keywords: Big data, Applicability, Convergence, Hadoop

    1.Introduction

    The quantity of data that a baby born today is going to produce is 70 times larger than that currently stored in the U.S. Assembly library." "While one piece of information is stored,

    there are some pieces of information which are not stored." YouTube video clips are uploaded one every 60 seconds. The statements above well imply the appearance of big data.

    Today, the hot issues in IT industry include big data, cloud computing, and convergence.

    Gartner, a research consultant agency, released the 10 major technologies and trends such as

    war of mobile devices and strategic big data that companies should cope with in 2013.

    Gartner predicted that in 2013, mobile phones would overtake PCs as a web access device

    most widely used all over the world, and that by 2015, smart phones would account for more

    than 80% of all mobile phones sold in advanced countries[1].

    He also predicted that personal clouds would replace PCs as a space where individuals

    store personal contents, access preferred services and objects, and concentrate one's digital

    life. In addition, many organizations would provide the employees with mobile apps through

    the exclusive app stores, and big data would be considered in strategic information

    architectures among companies rather than mere individual projects.

    When the term, 'big data,' first appeared, the meaning was interpreted differently. One

    group defines it as "terabyte data," and another defines it as "architecture of processing a large

    quantity of data." Since the meaning of the term, "big," itself is relative, however, it would

    not be appropriate to define an absolute standard for the data capacity.

    Big data is so large compared to existing data that it is difficult to collect, save, and

    analyze the structured or unstructured data in application of existing ways or methods.

    Mckinsey, one of the global consulting agencies, defined in one report released in 2011[2] big

    data as "a dataset that exceeds the capacity of existing database management tools in data

    collection, storage, management, and analysis," stating that "the definition is subjective and

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    74 Copyright 2014 SERSC

    will continue to change." The traditional concept of data and the characteristics of big data

    which is in a spotlight now are compared in Table 1 below:

    Table 1. Traditional Data/Big Data

    Element technology of big data includes media-related data volume, data input/output

    velocity, and data variety. Figure 1 shows three element technologies. The term, 'volume,'

    means a data attribute of tens of terabytes or tens of peta bytes in general. 'Velocity' is an

    attribute referred to in fast processing and analysis of large capacity data.

    In a convergence environment, digital data is produced at a high speed, and thus the system

    should be capable of saving, distributing, collecting, and analyzing it real time. 'Variety'

    indicates that there are various types of data, and they could be classified to structured, semi-

    structured, and unstructured data sets depending on the sort of structure. Table 2 shows three

    types of big data.

    Table 2. Types of Big Data

    Structured Data stored in a fixed fieldRelational database, spreadsheet,

    etc

    Semi-

    structured

    Not stored in a fixed field but including metadata, schema,

    etc; e.g., XML, HTML text, and so forth

    Unstructured Data not stored in a fixed field; text-recognizable documents,

    image/video/voice data, etc

    Figure 1. Three Element Technologies of Big Data

    It is reported that the quantity of data handled around the world today is doubled every two

    years[3, 5-8]. As IT is converged with other industry sectors and a tremendous amount of data

    is being generated, the utilization of big data has become a great issue in addressing desires

    Traditional data Big data

    Gigabytes to Terabytes Petabytes to Exabytes

    Centralized Distributed

    Structured Semi-structured and Unstructured

    Stable data model Flat schemes

    Known complex interrelationships Few complex interrelationships

    Volume

    Velocity

    Variet

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    Copyright 2014 SERSC 75

    and demands regarding the quality of life in this changing society.

    The most important factor in big data processing is the storage technology to collect

    various types of gigantic data as mentioned above and the data analysis technology to analyze

    it for a meaningful use. In this era of big data, new technologies such as hadoop have

    emerged and provided functions to process and analyze data that the existing technologies did

    not have[9]. Figure 2 shows the overview of Hadoop.

    Figure 2. Structure of Hadoop

    The utilization of big data now goes beyond the area of 'big data management' led by

    business entities and expands into the area of public service for general peoples [4]. Big data

    is made use of for improvement of national competitiveness, not merely for corporate

    competitiveness.

    Big data has been discussed mainly in the category of innovative business management in

    an attempt to reflect market demands in corporate management by collecting and analyzing a

    large quantity of data generated from various mobile devices and social media. Big data has

    been also considered in minimizing product defects in reference to a tremendous amount of

    data from the production line as well as planning systematic distribution tasks. Global large

    IT companies have released big data solutions in domestic markets, focusing their corporate

    management on product promotion and education.

    Recently, however, as big data is introduced into public service sectors, its concept is

    broadened to the area of the whole community as well as that of corporate management.In

    other countries, new public service models have already been presented in combination of big

    data and system integration(SI), providing civilians with high quality services. Big data

    solutions now function as the brain of information-based systems of public agencies and play

    an important role in enhancing the quality administrative services.

    2. Big Data Processing Models

    Various types of platforms handling big data basically consist of the three elements:

    storage system, handling process, and analysis mechanism. This study focuses on the

    platform technology related to the handling process among the three elements stated earlier.

    Parallel DBMS and NoSQL, two storage systems, are same in that they adopt the horizontal

    expansion approach for large quantity data storage. Besides, there are existing storage device

    technologies such as SAN(Storage Area Network), NAS(Network Attached Storage), Cloud

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    76 Copyright 2014 SERSC

    file storage system such as Amazon S3 and OpenStack Swift, and distribution file systems

    such as GFS(Google File System), and HDFS(Hadoop Distributed File System). These are all

    designed for large quantity data storage.

    In a big data handling process, the core of parallel processing is 'Divide and Conquer,' that

    is, to divide data into independent sets and handle them in parallel. Big data processing

    divides a problem into multiple small operations, collects them, and combines them as one

    single result. As for operation dependency, however, the advantage of parallel operation is

    invalid. In reflection of this limitation, the proper data storage and processing method is

    necessary. One of the well-known large quantity data processing technologies is the Map-Reduce

    distribution data processing framework such as Apache Hadoop. Map-Reduce data processing

    is illustrated in Figure 3.

    Figure 3. Principle of Map-Reduce Data Processing.(Source: Amazon Web Service)

    The characteristics of map reduce model are below.

    A common embedded hard disk drive in a common computer may be used for the operation with no need for special storage means. As each computer has a quite weak

    correlation to one another, it is possible to expand the link to hundreds or thousands of

    units.

    Since a number of computers are involved in the processing, it is assumed that malfunctions of the system including hardware are not exceptional but common.

    Complicated problems can be solved with the simple and abstract basic operation of Map and Reduce. Even programmers unfamiliar with parallel programs can readily handle data

    in parallel.

    It can handle high throughput rates when a number of processors are used simultaneously.

    Figure 4 shows the concept of Map-Reduce programming.

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    Copyright 2014 SERSC 77

    Figure 4. Concept of Map-Reduce Programming

    Dryad is a framework to form data channels between programs in graph and handle them

    as data sets in parallel. Map-Reduce framework developers design Map and Reduce functions

    while Dryad developers design data processing in graph. Dryad can process data flows in the

    format of DAG(Direct Acyclic Graph).

    Parallel data operation frameworks such as Map-Reduce and Dryad provide sufficient

    functions to process big data, but they involve some barriers against inexperienced developers,

    data analyzers, and data minors. It is necessary, therefore, to develop a method of a higher

    level of abstraction to handle data in an easier manner. Apache Pig and Apache Hive to be

    explained below are the two examples of such a framework.

    Apache Pig provides a high level of large quantity data combining and processing structure.

    Apache Pig supports the Pig Latin language. Pig Latin as the following characteristics:

    A high level of structures such as relation, bag, and tuple are provided in addition to the basic types such as int, long, and double.

    Relational operations(relation, table) such as FILTER, FOREACH, GROUP, JOIN, LOAD, and STORE are supported.

    User-designated functions can be defined. A data processing program of Pig Latin is converted into a logic execution plan, which is

    again converted into a Map-Reduce execution plan.

    Apache Pig adopts an approach to designing a large quantity data processing program in a

    format of procedural programming languages such as C and Java. Sawzall of Google adopts a

    similar approach. Some technologies adopt declarative data processing methods such as SQL

    instead of specifying data processing procedures as in a programming language. Apache Hive,

    Google Tenzing, and Microsoft SCOPE are some of the examples.

    Apache Hive is a technology to analyze large quantity original data sets such as HDFS and

    HBase by means of a query language called HiveQL. It can be divided to the Map-Reduce

    based execution part, metadata information on the data storage, and execution part based on

    the queries received from users or applications in terms of architecture. To support user

    expansion, it is possible to designate a user-defined function on the level of scalar value,

    aggregation, and table.

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    78 Copyright 2014 SERSC

    3. Big Data Applicability

    Big data is based on data sets generated in all areas of human activity. Hence, such data

    may belong to public service areas, companies, or daily life. This chapter, accordingly,

    divides the utilization of big data to public service and general areas.

    3.1. Public Services

    Rio de Janeiro, Brazil, prepares for various kinds of possible urban disasters in utilization

    of big data, which is a good example that existing big data is utilized for national

    management rather than innovative management of a business entity.

    In preparation for the World Cup Games in 2014 and the Olympic Games in 2016, Rio de

    Janeiro has established its urban management and emergency response system through IOC.

    Data and processes of more than 30 different agencies are integrated into IOC so that the

    general movement in the city can be monitored 24 hours a day, 365 days a year.

    IOC includes the integrated management system for water resources as well as traffic,

    electric power, and natural disasters such as flood and landslide. In particular, the analysis

    solution of IBM makes it possible to predict and respond to emergency situations effectively.

    The high resolution weather forecast system and hydrologic modeling system provided by

    IBM analyze a large quantity of data related to weather and hydrology and predict a heavy

    rain 48 hours in advance.

    Based on the integrated mathematical model of data extracted from river basin

    topographical materials, precipitation statistics, and radar photographs, future precipitation

    and sudden floods are predicted. In addition, other situations that affect the city such as heavy

    traffic and blackout are also evaluated.

    The new auto alarming system notifies municipal office officials and Emergency Response

    Team of any changes in flood and landslide predictions over the area of Rio de Janeiro. This

    type of alarming system makes use of instant communication tools such as auto email

    notification, SMS text message, etc. to the emergency response team and civilians upon

    emergency situations unlike previous systems that required handworks for notification, and

    thus the response is more prompt and requires less time.

    Singapore is going through serious traffic congestion due to the drastic increase of vehicles.

    Big data is regarded as a new solution among Singapore administrative agencies.

    Singapore operates a traffic prediction system called 'TPT' that goes beyond the existing

    real time traffic information based on big-data analysis. Singapore traffic bureau(LTA)

    predicts the urban traffic conditions and flows in utilization of the i-Transport system and

    other prediction tools. Such a prediction system consists of the traffic flow analysis and

    prediction sub-systems. As LTA traffic control center sends real time traffic data collected

    from sensors, the traffic in one hour is predicted by modeling the traffic scenario. According

    to IBM that provided the solution, the accuracy of the general prediction result is about 85%

    and even higher particularly in business centers where the traffic is heavier.

    DC Water, which manages the water and sewage system in Washington D.C. in the U.S.,

    introduced the big data system for effective management of the sewage and collection system.

    The prediction analysis system made it possible to manage the assets and facilities such as

    sewage piping, valve, public water system, collection piping, manhole, and gauges effectively.

    By using this system, the workers can check the location and status of company assets in

    detailed maps and promptly address the asset details and total costs, local problems, different

    types of regional water quality problems, and so forth.

    In particular, D.C. Water prevented unexpected service suspension based on its prediction

    analysis and established a new ratio model based on the service demands.

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    Copyright 2014 SERSC 79

    In addition, the enhanced preventive measures and auto meter-reading reduced customer

    calls as much as 36% and simplified the process. Currently, 93%(previously, 49%) of the

    tasks are handled within 10 minutes.

    3.2. Utilization for Living Facilities

    What if an electricity charge 'bomb' fell on consumers? There might be no other solution

    but 'making up one's mind to save electricity.' In particular, electricity and water charges do

    not specify the details as if credit cards do, and thus there is no way to look into the specific

    items but pay the charges every month. Big data, however, is expected to solve such problems.

    Awarded 'MacArthur Fellow' Prize and working as an assistant professor at the department

    of computer science, Washington University, Shwetak Patel has found the most reasonable

    way of calculating utility bills. His idea is based on the fact that all devices for electricity,

    water, and gas supply coming into the house have specific digital signals.

    He designed a sensor that recognizes such signals through a simple algorithm, and this

    sensor is installed at such areas as gas piping, electricity wiring, sewage piping, ventilator, etc

    to produce and transmit digital signals to tablet PCs for real time checkup.

    For instance, it is possible to check the amount of electricity consumed by a certain

    electronic appliance and how much water or gas is being used. Patel installed the water and

    electricity measuring sensor at an iPad to measure the usage at his cousin's place. As a result,

    he found out that 11% of the electricity is consumed by the swimming pool electricity pump.

    This innovative energy sensor invented by Shwetak Patel will be utilized in our daily life

    in the near future. In 2010, Belkin bought the patent of the energy sensor from Shwetak Patel,

    and the commercialization is in progress. This sensor will make it possible to grasp the exact

    amount of electricity consumed by certain devices and thus to take proper preventive

    measures. Besides, a large amount of information can be gained on the power supply to

    hundreds of thousands of households in a future smart city.

    Everyone living in Singapore knows that it is difficult to catch a taxi on a rainy day. The

    Singapore-MIT Technology Institution conducted a project to compare 830 million GPS

    records from a taxi running database and weather satellites for two months. In 2011, special

    patterns were found in the middle of this project. As a result of analyzing data from more than

    16,000 taxies, it turned out that many taxies would not move in a storm. GPS records too

    showed that when it was raining, a number of taxi drivers stopped the car and no longer

    picked up passengers.

    Based on the analysis on taxi drivers' daily routine, the reasons were examined. Singapore

    taxi companies required drivers to deposit 1,000 dollars unconditionally from their monthly

    pays upon an accident until the close examination on the cause was completed. For this

    reason, taxi drivers would stop the car and wait until the weather turned fine rather than take

    risks on a rainy day. After finding out the cause, the company rules were modified in a better

    way for the taxi companies, drivers, and passengers. This is one example that shows how big

    data could improve the quality of civilians' life.

    Data analysis is not limited to mere data reading but should involve insight to find

    unexpected causes and results.

    There is another example showing that big data analysis is an important means to read new

    distribution trends. In the U.S., on Thursday in the fourth week of November, which is called

    'Black Friday,' the day after the Thanksgiving day, the largest sales and shopping events of

    the year are held. A big data analysis result, however, shows that 'Cyber Monday' is now

    attracting more attention as a new shopping day than Black Friday.

    Cyber Monday is the first Monday after the Thanksgiving day and weekend, on which

    people back to work tended to be absorbed in online shopping, and this trend has caught

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    80 Copyright 2014 SERSC

    attention of distributors. According to one report, the online sales on Cyber Monday last year

    was 29.3% higher than that on Cyber Monday. The mobile traffic and mobile sales as well

    were 10.8% and 6.6% respectively, which were close to 14.3% and 9.8% on Black Friday.

    This result is based on the tera byte big data and the database of more than million

    transactions a day from 500 major distributors all over the country, which were analyzed by

    IBM.

    4. Conclusions

    Currently in our society, smart phones are commonly used, and various data producing

    devices such as tablet PC, camera, and game console recently emerged, which have increased

    the traffic drastically. In addition, as the volume and types of data become diversified and

    data increase velocity is rapid, the era of 'big data's seems to be just ahead. Today, it is

    reported that the amount of digital information handled around the world is doubled every

    two years. As IT is converged with other industry sectors, a large quantity of data is produced

    every day, and the issue of utilizing big data in addressing desires and demands for a better

    quality of life in the changing society is in a spotlight.

    Recently, companies seek ways of better corporate management and more efficient

    marketing activity based on big data analysis. Accordingly, system streamlining in analysis of

    big data is being sought in various areas including such public sectors as traffic system, water

    resource system, security, system, tax evasion preventive system, and medical system.

    Multiple platforms to handle big data basically consist of the storage system, handling

    process, and analysis mechanism. This study aims to comparatively analyze platform

    technologies related to the handling process among the three elements stated above.

    In the future, the interest in and use of big data platforms will continue and expand. The

    applicable area too will go beyond pure IT and be expanded to every possible sector. When

    such efforts are consistently expanded and developed, the future society will open the door to

    a world of infinite possibility.

    References [1] Gartner, CEO Advisory: Big Data Equals Big Opportunity, (2011). [2] McKinsey, Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global

    Institute, (2011).

    [3] P. Warden, Big data Glossary, OReilly Media, (2011). [4] J. S. Jung, 3 Factors for Successful Big Data Usage: Resource, Technology, Manpower., Big Data Strategy

    Forum, (2012).

    [5] IDC, Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO, (2011). [6] IDC, New Analytics Strategies in the Big Data, Era, (2011). [7] R. Cattell, High Performance Scalable Data Stores, (2010). [8] MT Slon Management Review, Big Data, Analytics and the Path From Insights to Value, (2011). [9] J. Kelly, Big Data: Hadoop, Business Analytics and Beyond, Wikibon, (2012). [10] M. Hilbert, P. Lopez, The Worlds technological capacity to store, communicate and compute information,

    Science, (2011).

    [11] M. Choi and N. G. Kim, Future Smart Device Development Architecture, IJSEIA, vol. 7, no. 3, (2013), pp. 311-322.

    [12] R. D. Caytiles and B. J. Park, A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services, IJSEIA, vol. 7, no. 2, (2013), pp. 219-226.

    [13] H. Sug, Generating Better Radial Basis Function Network for Large Data Set of Census, IJSEIA, vol. 4, no. 2, (2010), pp. 15-22.

  • International Journal of Software Engineering and Its Applications

    Vol.8, No.8 (2014)

    Copyright 2014 SERSC 81

    Authors

    Byung-Tae Chun, (PhD11) He received the Ph.D in computer

    engineering from Korea University, in 2011. From 1989 to 1996, he

    was a researcher at KIST(Korea Institute of Science and

    Technology). From 1996 to 2004, he was a senior researcher at

    Electronics and Telecommunication Research Institute). Since 2004,

    he has been a professor at Hankyong National University, Korea. He

    is a member IEEK and KIIT.

    Seong-Hoon Lee, (MSc95PhD.98) He receivedthe M.Sc. degreeof Computer Science and Engineering from Korea University,

    Seoul, Korea in 1995, and the Ph.D. degree of Computer Science

    and Engineering from Korea University, KOREA in 1998. Since

    1998, he is a Professor in School of Information and

    Communication, BaekSeokUniversity, Korea. His main research

    interests include Distributed system, Grid Computing, Web Service,

    etc.

  • Copyright of International Journal of Software Engineering & Its Applications is the propertyof Science & Engineering Research Support soCiety and its content may not be copied oremailed to multiple sites or posted to a listserv without the copyright holder's express writtenpermission. However, users may print, download, or email articles for individual use.