E cient Spatio-Temporal Network Analytics In ... · spatio-temporal data1 comes from varied sources. Data size has become an integral feature of large-scale epidemiological studies,

Efficient Spatio-Temporal Network Analytics In EpidemiologicalStudies Using Distributed Databases

Mohammed Saquib Akmal Khan

Thesis submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Sciencein

Computer Science and Applications

Madhav Marathe, ChairSandeep Gupta

B. Aditya PrakashAnil Vullikanti

November 13, 2014Blacksburg, Virginia

Keywords: Data Analytics, Data Mining, Distributed Systems, Database SystemsCopyright 2014, Mohammed Saquib Akmal Khan

Efficient Spatio-Temporal Network Analytics In Epidemiological StudiesUsing Distributed Databases

Mohammed Saquib Akmal Khan

(ABSTRACT)

Real-time Spatio-Temporal Analytics has become an integral part of Epidemiological studies. Thesize of the spatio-temporal data has been increasing tremendously over the years, gradually evolvinginto Big Data. The processing in such domains are highly data and compute intensive. Highperformance computing resources resources are actively being used to handle such workloads overmassive datasets. This confluence of High performance computing and datasets with Big Datacharacteristics poses great challenges pertaining to data handling and processing. The resourcemanagement of supercomputers is in conflict with the data-intensive nature of spatio-temporalanalytics. This is further exacerbated due to the fact that the data management is decoupled fromthe computing resources. Problems of these nature has provided great opportunities in the growthand development of tools and concepts centered around MapReduce based solutions. However, webelieve that advanced relational concepts can still be employed to provide an effective solution tohandle these issues and challenges.

In this study, we explore distributed databases to efficiently handle spatio-temporal Big Data forepidemiological studies. We propose DiceX (Data Intensive Computational Epidemiology usingsupercomputers), which couples high-performance, Big Data and relational computing by embed-ding distributed data storage and processing engines within the supercomputer. It is characterizedby scalable strategies for data ingestion, unified framework to setup and configure various process-ing engines, along with the ability to pause, materialize and restore images of a data session. Inaddition, we have successfully configured DiceX to support approximation algorithms from MADlibAnalytics Library [54], primarily Count-Min Sketch or CM Sketch [33][34][35].

DiceX enables a new style of Big Data processing, which is centered around use of clustereddatabases and exploits supercomputing resources. It can effectively exploit the cores, memoryand compute nodes of supercomputers to scale processing of spatio-temporal queries on datasets oflarge volume. Thus, it provides a scalable and efficient tool for data management and processing ofspatio-temporal data. Although DiceX has been designed for computational epidemiology, it canbe easily extended to different data-intensive domains facing similar issues and challenges.

We thank our external collaborators and members of the Network Dynamics and Simulation ScienceLaboratory (NDSSL) for their suggestions and comments. This work has been partially supportedby DTRA CNIMS Contract HDTRA1-11-D-0016-0001, DTRA Validation Grant HDTRA1-11-1-0016, NSF - Network Science and Engineering Grant CNS-1011769, NIH and NIGMS - Models ofInfectious Disease Agent Study Grant 5U01GM070694-11.

Disclaimer: The views and conclusions contained herein are those of the authors and should not

be interpreted as necessarily representing the official policies or endorsements, either expressed or

implied, of the U.S. Government.

Dedication

I dedicate this work to my parents, siblings and my lovely nieces - Alesha and Ritisha, andnephew, Zayaan.

iii

Acknowledgments

All the praises be to Allah, the Lord of the ’Alamin

I would like to thank my Chair and Committee members for their invaluable guidance andsupport. Dr. Sandeep Gupta has been an inspirational mentor, and I am highly indebtedto him for his time and efforts. His valuable inputs and guidance have been the yardstick ofmy research. Working with him has been a fruitful and exciting quest for knowledge. I amalso equally grateful to Dr. Jiangzhuo Chen for his immense contributions towards shapingmy research.

I am thankful to Dr. Madhav Marathe for being a wonderful advisor and giving me theopportunity to work at NDSSL-VBI. Working at NDSSL has been an extraordinary experi-ence. It has provided me great exposure to quality research in varied areas. I am gratefulto Dr. Aditya Prakash and Dr. Anil Vullikanti for being in my committee and providingvaluable inputs on my research. I would also like to extend my gratitude to Dr. Edward A.Fox and Dr. Sunshin Lee for their support and guidance.

Finally, I express my deepest gratitude to my family and friends for their unconditional love,support and encouragement. My parents - Shahnaz and Ajmal have inspired me each day towork towards my goals. I sincerely thank my siblings - Ahtesham, Faisal, Shazia and Shariqfor their constant support and love. I would especially like to thank Faisal for believing inme and supporting my decisions. My sincere love and affection to Sareena, Kamran, Zeba,Alesha, Zayaan, Adeeba, Zaara, Shaheera and Numra for being so kind and supportive.

iv

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Spatio-Temporal Analytics 6

2.1 Data sets in Computational Epidemiology . . . . . . . . . . . . . . . . . . . 6

2.2 Spatio-Temporal Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Related Work 12

3.1 MapReduce and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Distributed and Parallel Databases . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Postgres-XC 17

4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Global Transaction Manager or GTM . . . . . . . . . . . . . . . . . . 18

4.1.2 Coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.3 Datanode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.4 Single-Compute node and Multiple-Compute node setup . . . . . . . 19

v

4.2 Query Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Count-min sketch or CM sketch 23

5.1 Data structure and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 MADlib implementation using UDA and UDFs . . . . . . . . . . . . . . . . 24

5.3 CM Sketch example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Hive 27

6.1 Architecture and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Execution Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 DiceX Framework 32

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.3 Coupling Distributed Databases on Supercomputers . . . . . . . . . . . . . 34

7.3.1 Resources, Quota, and, Allocation Management of Jobs . . . . . . . . 34

7.3.2 Remote vs. Local Persistent Storage . . . . . . . . . . . . . . . . . . 35

7.3.3 Managing Port Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.4 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.4.1 Data Storage and Distribution . . . . . . . . . . . . . . . . . . . . . 36

7.4.2 Data Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.6 Indexing and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.7 CM Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.7.1 Implementation on Postgres-XC . . . . . . . . . . . . . . . . . . . . . 43

7.7.2 Aggregation Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.7.3 Query Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.7.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

vi

7.8 Data Intensive Computing using DiceX . . . . . . . . . . . . . . . . . . . . . 48

8 Experiments and Analysis 49

8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8.1.1 Hardware Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 49

8.1.2 Test Data Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8.2 Query Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8.3 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8.3.1 Single-compute node versus multi-compute nodes . . . . . . . . . . . 52

8.3.2 Indexing and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 53

8.3.3 Evaluation of cmsketch on DiceX . . . . . . . . . . . . . . . . . . . . 56

8.3.4 Performance comparison with RDBMS and MapReduce based tools . 57

9 Future Works and Conclusion 59

Bibliography 61

Appendix A Notations 69

Appendix B Table Schema 71

vii

List of Figures

2.1 Flucaster Front End - provides data visualization support for spatio-temporalqueries on epidemiological data. . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 County-level Infection - illustrates the county-level infection of a particulargender belonging to a particular age group over a given duration. . . . . . . 10

4.1 PGXC Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Query Handling in Postgres-XC. The query, Q is broken down into coordinator-operations and Remote Queries, RQ. Each datanode executes the remotequery and sends the result to the cooridnator. The coordinator consolidatesthe results from all the datanodes and generates the output of the query. . . 19

4.3 Query Life Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 CM Sketch with Update operations. . . . . . . . . . . . . . . . . . . . . . . 23

5.2 cmsketch count - is a scalar User Define Function (UDF) in MADlib Libraryto compute the approximate number of occurences of a value in a columnsummarized by a cmsketch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Array count [w,d ] for CM Sketch. . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1 Hive Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Map and Reduce in Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3 Stage Plan for Query Q6 on Hive. Each individual stage consists of Map-Reduce jobs to be executed on Hadoop cluster. . . . . . . . . . . . . . . . . 30

7.1 DiceX Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.2 Data Migration - Standalone (Oracle) to Parallel (Postgres-XC) databases. 37

7.3 Parallel Ingestion using multiple views. . . . . . . . . . . . . . . . . . . . . 38

viii

7.4 DiceX allows optimized data ingestion in a table from file or a table in aremote database by partitioning the data source and ingesting each partitionconcurrently. Plot shows performance varying with the number of partitionswhen remote datasource is file vs. a table in a database. Although, theingestion scales with number of partitions, it is interesting to observe thatboth file and remote database have similar costs even though the ingestionfrom remote database has several additional steps in addition to data travelinga larger distance. This is possibly because high insertion cost in the Postgres-XC itself which dominates the cost of read and transfer. The Y-Axis is inLog-Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.5 Performance of parallel ingestion using copy command. The Y-Axis is inLog-Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.6 Two-phased and Three-phased aggregation modes for CM sketch on Postgres-XC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.7 Comparison of query plan on Two-phased and Three-phased Aggregationmodes for cmsketch (without indexing and clustering). . . . . . . . . . . . . 46

8.1 Query Plan for Q6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8.2 Query Plan for Q8 and Q9. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8.3 Single Compute node setup Vs Multiple Compute node setup. . . . . . . . . 53

8.4 Query execution with and without index for different Spatio-Level of State(Q6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8.5 Query execution with and without index for Spatio-Level - county(Q8) andblock(Q9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8.6 Speedup for different Spatio-levels. . . . . . . . . . . . . . . . . . . . . . . . 55

8.7 CM Sketch at coordinator and datanodes for (a) Point Query (b) Range Query. 56

8.8 Performance of Spatio-temporal queries using DiceX framework compared toOracle (standalone database) and Hive (MapReduce). The Y-Axis is in Log-Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

ix

List of Tables

2.1 Epidemiology Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Structure and Spatio-Level of the Flucaster queries . . . . . . . . . . . . . . 9

2.3 Spatio-temporal queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7.1 Symbols and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.2 Basic operations for CM Sketch cost analysis . . . . . . . . . . . . . . . . . . 47

x

Chapter 1

Introduction

1.1 Background

Epidemiology can be defined as a formal branch of science, which studies space-time patternsof disease in a population and the factors that contribute to these patterns [19][45][71].These studies often represent demographically and geographically diverse population, wherespatio-temporal data1 comes from varied sources. Data size has become an integral feature oflarge-scale epidemiological studies, and is directly proportional to storage and computationalbottlenecks [29][95].

Large-scale epidemiological studies require high degree of computations owing to the natureof the data and the operational complexities. The need for compute-intensive and data-intensive computations along with technological advancements led to the development of amultidisciplinary field, Computational Epidemiology. It develops and uses computer modelsto understand the spatio-temporal diffusion of disease through populations [22][24].

The size of the epidemiological data is becoming large resulting in unique and never-to-be-replicated resources [56]. Hernan et al. suggest that scaling-up of epidemiologic resourcesrequires them to be consolidated into increasingly large clusters. The Angioedema assess-ment [95] (a part of Mini-Sentinel program [23]) provides a real world experience of usingdistributed network of databases to support large-scale epidemiological studies. As the datasize increases, it becomes imperative to manage the data efficiently. Meyer et al. [74] pro-pose the concept of central data management for large-scale medical and epidemiologicalstudies. Storage and management of large volume of spatio-temporal data is very crucial forlarge-scale epidemiological studies.

In this study, we propose a solution for efficient storage and management of large volumeof epidemiological data using distributed databases coupled with high performance comput-

1The terms spatio-temporal data and epidemiological data are used synonymously

1

Chapter 1. Introduction 2

ing resources. It addresses the data management and processing needs of spatio-temporalanalytics.

Spatio-temporal analytics in epidemiology is ingrained with issues and challenges posedby large volume of epidemiological data. In [73], Meliker et al. provide an interestingreview of the critical domains regarding the principles and opportunities in spatio-temporalepidemiological studies. We propose an efficient solution to store and manage large volumeof epidemiological data and support scalable spatio-temporal analytics on this data.

1.2 Motivation

Due to global connectivity, the interactions between populations across the globe has beengrowing significantly over the last couple of decades. This also leads to situations where anepidemic can quickly spread to a very large population in a very small duration of time.The H1N1 outbreak [6] and the recent Ebola flu virus [4] are few examples of rapid diseasetransmission through direct contacts between people. Hence, there is increased requirementto model and simulate regions that span several countries. Let us consider the combinedpopulation of USA and Mexico, which is nearly 446 Million (Worldometers [15], as on 12thNov 2014). A factorial design to simulate such population size with different intervention,parameter values, stochasticity and other factors might lead to 10-100 runs. This generatessimulation output data with several hundreds of gigabytes in size. Although more complexanalytics is desired, even simple spatio-temporal analytics is challenging at this scale. Ourgoal is to design a scalable system using super-computing resources to efficiently handle largedatasets and provide real time interactions with it.

One of the challenges to design such a system is the conflicts between resource manage-ment and the design of supercomputers with the data and compute intensive nature of theepidemiological studies. The problem is further exacerbated as the management of data isdecoupled from computing resources, as computational requirements are met through theuse of supercomputers while data-intensive computing is still handled via various scripts andprograms that connect (directly or indirectly) to databases. This approach is not sustainablefor epidemiological studies where increasingly large fraction of all computational activities arebecoming data-intensive. The existing relational database servers are both limited in degreeof computation and bandwidth to handle all computational workload. On the other hand,it is too cumbersome to perform data migration to supercomputers, perform computationand ingest derived data back to the database due to lack of tools and software available aspart of the computational framework. Moreover, much of epidemiology computation requiresrelational style processing (see [25]) which is not available on supercomputers.

We investigated and found that if supercomputing resources and distributed databases areused in tandem, they can provide an effective solution to the issues and challenges poseddue to compute and data-intensive science. In order to achieve this we need to align them


and exploit their capabilities and capacities. Our prime objective is to make a scalabledata-intensive system using supercomputer cores to all aspects of spatio-temporal analyt-ics in computational epidemiology. This system should be scalable and robust, capable ofproviding efficient data management along with satisfying the computational needs of suchdata-intensive sciences.

DiceX is our toolset that provides a data intensive computation layer on top current super-computer system using data engines such as Postgres-XC and (multiple) Postgres standalonedatabase. This layer enables various styles of massively parallel data intensive computations,in particular those arising in spatio-temporal analytics. The toolset provides efficient util-ities for the entire spectrum of data-intensive computations using supercomputers, whichincludes scalable strategies for ingestion of external datasets, unified framework to setupand configure various processing engines; and ability to pause, materialize, and restore theimage of data session.

The COUNT aggregate (GROUP-BY) queries are central to analysis in computational epi-demiology. These class of aggregate queries require full scan of data [100], which tends to bevery expensive. Traditionally, this wasn’t a problem, since aggregate queries were confined todata warehousing where response time was not critical. All analytics operations used to beperformed overnight and materialized for further processing. Computational epidemiologyand other Big Data related applications require such analytics to be performed in real tonear real time. Real time or near real time response is critical for decision support systemswhich requires instantaneous analysis for risk mitigation and planning in case of epidemicoutbreaks. It can be useful in forecasting and providing effective support and guidance topolicy-makers. DiceX aims to address the scaling and acceleration of these aggregate queriesand providing near real-time solutions. It utilizes supercomputing resources to scale aggre-gate queries for very large datasets using existing sequential relational and sketch operatorsto execute in parallel.

1.3 Our Contribution

DiceX enables a new style of Big Data processing, which is centered around use of databasetechnologies and exploits supercomputing run-time and resources. We can show that it caneffectively exploit the cores, memory, and compute nodes of supercomputers to scale variouskinds of processing, primarily those supporting spatio-temporal analytics. Although it hasbeen used solely for computational epidemiology, we believe that it has an important role toplay in all data-intensive sciences that experience variety and veracity in datasets.

Our contribution in this thesis are as follows:

• DiceX, a scalable framework built with parallel database system within HPC environ-ment. We demonstrate the use of DiceX for spatio-temporal analytics on large sets of


identically structured datasets.

• It supports functionality to pause a running database session and migrate its image toanother set of compute nodes. The database session can be resumed later on a newcluster of machines. This feature makes a completely different style of computationfeasible, where a data image is quickly made “alive” over large number of computenodes to enable a data-intensive analysis or to drive simulations. The fast turnaroundtime implies effective usage of supercomputing resources.

• We propose and study the performance of parallel data ingestion in DiceX. Using thismethod large datasets can be made alive for various epidemiological workloads usinga fraction of the overall analysis time.

• Enable DiceX to support approximation algorithms, such as Count-Min Sketch or CMSketch for distributed and approximate processing count queries (Point and Range).We also provide a cost analysis of CM Sketch on DiceX.

• Performance study of existing RDBMS with DiceX and MapReduce approaches.

• Finally, we were able to provide near real-time spatio-temporal query processing onlarge datasets. We were successfully able to use DiceX to provide data managementsupport for Flucaster [77], which is a real-time application.

Based on our experience with data intensive computing within supercomputers, we arguethat following changes in the overall supercomputer setup to make it accessible for datadriven sciences

• Persistent storage should be part of the interconnect so that MPI-IO can be used fordata transfer among compute nodes.

• Emphasis should be place on distributed storage (local to compute node) as oppose toshared over the network data usage.

• Data-intensive computation seems to be limited by core count.

• Many aspects of inherent inefficiencies in the network stack.

1.4 Organization of the Thesis

Spatio-temporal analytics is our area of focus and the Chapter 2 is dedicated to cover variousaspects spatio-temporal analytics. Literature review and related works are covered in Chap-ter 3, wherein, we have covered the various state-of-the-art tools and concepts related to ourarea of research. The foundation of our work is based on distributed databases. The Chapter


4 gives an overview of the architecture of Postgres-XC. We have discussed the concepts andalgorithms of cmsketch in Chapter 5. When Big Data processing is involved, it becomesimperative to discuss MapReduce based tools. Hive (refer Chapter 6) is an interesting toolthat provides relational flavor to Hadoop. Finally, we describe our proposed tool, DiceX inChapter 7 and it’s performance evaluation in Chapter 8. Chapter 9 discusses the future andscope of DiceX in spatio-temporal analytics and in other domains.

Chapter 2

Spatio-Temporal Analytics

Spatio-Temporal Analytics deals with change of spatial attributes of objects over time. Theimportance of spatio-temporal mining is growing rapidly with increasing size of the datasets.It draws great interest from both industry and academia. We can find its use in varied ar-eas, from Wireless Communication, Transport Systems [37], Situation Awareness Application[50], Distributed Camera Networks [58], Epidemiological Studies [73], Geo-Spatial Applica-tions and other such areas involving data-intensive and compute-intensive processing. Wediscuss a few applications of Spatio-Temporal Analytics in the following paragraphs.

TransDec [37] provides a data-driven framework for efficient management of real time spatio-temporal data for large-scale transportation systems. It also provides efficient processing ofreal time spatio-temporal queries on transportation networks. Hong et al. [58] identifiedlarge scale scenarios as a key problem for scalability of spatio-temporal analysis on cameranetworks. To overcome these performance bottlenecks, they proposed a distributed systemmodel using resources in the cloud. [50] extends the use of spatio-temporal analytics to socialand behavioral sciences. It provides a detailed study of spatio-temporal signatures of robbery,burglary and assault. Spatio-temporal query processing is the driving force for geo-spatialapplications. These class of queries are I/O and compute intensive. When these are coupledwith large volume of geo-spatial data, they pose great opportunities along with challengesto the current state-of-art query processing techniques. [103] proposes an efficient spatio-temporal query processing scheme over big geo-spatial data. HaSTE is a spatio-temporalextension of Hadoop, which provides an efficient solution to handle the afore-mentionedproblems.

2.1 Data sets in Computational Epidemiology

The overall methodology of computational epidemiology consists of network-based modelsand computer simulations. Primarily, it consists of a model for communicable diseases that

6

Chapter 2. Spatio-Temporal Analytics 7

captures important characteristics [71]. A synthetic population generator is built to usevarious spatial, census, and survey data which tends to create a realistic demographic. Theresulting demographic matches the actual population base as closely as possible on param-eters relevant for contagion study, such as type and geographical location of houses, theinhabitants of these houses, their age, gender, income, and most importantly, their dailyactivities. The strategies for computation of contagion are followed with the given syntheticinformation and an instance of the disease model. The computation takes an instance of thedisease model and population data from the synthetic generator as the input. The compu-tational engine performs agent-based discrete time step simulation, where each individualin the population is treated as an agent and the duration is divided into a fixed number oftime-steps. In each time-step, the simulation computes the health of each individual basedupon the disease model and the activities recorded by the synthetic generator. The outputof these simulations provide us with data sets used in a computational epidemiology.

EpiFast [24] is an HPC-based, fast and scalable simulation tool capable of simulating diseasedynamics over large dynamic social contact networks. It can be used to study spatio-temporaldiffusion of diseases through populations. EpiSimdemics [22] is another scalable HPC-basedframework to simulate the spread of infectious disease in large and realistic social contactnetworks using individual-based models to compute disease dynamics.

Briefly, the types of data sets used in a computational epidemiology workflow are givenbelow:

Synthetic Population Data: It describes individuals and their activities.

Synthetic Contact Data: It describes activities performed by the individuals, whichinclude travels and visits to locations of work, home, and daily chores. As part of theactivity, people come in close contact with other individuals. The contact data captures thisinformation and is central to the study of epidemics.

Model Data: They describe how the virus affects the population and is characterized by at-tributes like susceptibility, infectivity, symptom severity, prodrome severity, and incapacitate[96]. For some diseases weather is correlated to the spread of diseases.

Surveillance Data: Describes continuously gathered health data for given population.Epidemic surveillance data can come from the Centers for Disease Control and Prevention,the World Health Organization, national respiratory and enteric virus surveillance systems,as well as systems like Google Flu Trends.

Social Media Data: Provides information about people and their behavior. Individualsfrequently interact, share, and exchange information through social media. Social mediaposts can help researchers to identify and analyze epidemics.

Simulation Setup Data: Computational epidemiology experiments are divided into mul-tiple cells. Each cell characterizes one set of quantified simulation conditions, and may havemany replicates (25 is typical). Numerous intervention actions are applied to the epidemic


simulation for decision making processes.

Category Data Size Storage

SyntheticPopulation

Household,Person

Activity566 (GB) Relational

SocialNetwork

and Output

ContactNetwork,

SimulationOutput

1.84 (TB) File

Experiment Experiment 240 GB Relational

Table 2.1: Epidemiology Datasets

Simulation Output Data: It describes the spread of disease through a population.

As evident from these datasets, agent-based simulation involves large scale data sets of vary-ing types. Table 2.1 shows the current size (for U.S., U.K., and a few other small countries)of the synthetic population, the social network, and experiments. Another, significantlyimportant, challenge for data management surfaces due to the logistics and incremental na-ture of the domain science. The datasets are distributed across several different databases,schema, file systems, and machines which makes experiment setup, execution, and, analysisa challenging task. Handling such large datasets requires supercomputing resources.

2.2 Spatio-Temporal Queries

Spatio-temporal queries are an interesting class of queries, which processes data defined byboth time and space. They are of significant importance when analyzing epidemiologicaldatasets. Analysis and visualization of large spatio-temporal datasets is one of the keychallenges facing computational epidemiology. Flucaster [77] is an in-house application forDisease Forecasting and Situation Assessment using agent-based stochastic simulation [24].It is an interactive application that allows users to generate their own queries by setting thespace and time filters. It offers visualization support for the spatio-temporal analytics onFlucaster dataset. Almost all spatio-temporal queries in Flucaster contain COUNT aggre-gates on either spatial, demographic, or, temporal attributes. [77] The Figure 2.1 representsthe front end of the Flucaster application.

Spatio-Temporal queries in Flucaster fits in perfectly well for performance evaluation ofDiceX. Currently the Flucaster application uses RDBMS systems for data storage and man-agement. This spatio-temporal data is generated from EpiFast [24]. Flucaster is an idealcandidate for our study as it provides us an opportunity to compare and evaluate the perfor-mance of existing RBMS with DiceX without any considerable effort. Let us find the count


Figure 2.1: Flucaster Front End - provides data visualization support for spatio-temporalqueries on epidemiological data.

Query Structure Spatio-temporal LevelQ1 Oe ◦ Γe ◦ σe(SES) StateQ2 Oe ◦ Γe ◦ JcK◦ ./p ◦ σe(DI,SES) CountyQ3 Oe ◦ Γe ◦ JbK◦ ./p ◦ σe(DI,SES) BlockgroupQ4 Oe ◦ Γe ◦ JeK◦, σe(SES) StateQ5 Oe ◦ Γe ◦ JaK◦ ./p ◦ σe(DI,SES) StateQ6 Oe ◦ Γe ◦ Je, g , aK◦ ./p ◦ σe(DI,SES) StateQ7 Oe ◦ Γe ◦ Jc, e, g , aK◦ ./p ◦ σe(DI,SES) CountyQ8 Oc ◦ Γc◦ ./p ◦ σc(DI,SES) CountyQ9 Ob ◦ Γb◦ ./p ◦ σb(DI,SES) BlockgroupQ10 Oc ◦ Γc ◦ JeK◦ ./p ◦ σc(DI,SES) CountyQ11 Ob ◦ Γb ◦ JaK◦ ./p ◦ σb(DI,SES) BlockgroupQ12 Ob ◦ Γb ◦ Je, g , aK◦ ./p ◦ σb(DI,SES) Blockgroup

Table 2.2: Structure and Spatio-Level of the Flucaster queries

of infected people using Flucaster using the details as given below:

State : Texas

Age : Preschool (0-4),School Age(5-18)


Figure 2.2: County-level Infection - illustrates the county-level infection of a particular genderbelonging to a particular age group over a given duration.

Gender : Male

County : Anderson

Duration : 07 -20 -2014 to 10 -20 -2014 ( 3 Months)

The Flucaster application transforms these input details into a spatio-temporal query. Thequery generated in this case is at the spatio-level of county and can be identified as a classQ7 as described in Table 2.2. It counts the number of people infected in the Andersoncounty of Texas, belonging to the male population under the age of 18 years in last threemonths. Once the query is generated, it is executed and the results are displayed on theapplication front end. [77] The Figure 2.2 gives the infection count for the query along withvisualization of data. Similarly, we can compute the State-level and Block-level infectioncount using Flucaster application.

The queries generated by Flucaster can be very effective in studying spatio-temporal analyt-ics on epidemiological datasets. We have selected 12 different class of queries (refer Table 2.3)based on Flucaster and have used it for various experiments. These different classes helpsus to analyze data at different spatio-levels. The structure of these queries is illustrated inthe Table 2.2 along with their spatio-level.


Query Description

Q1How many people got infected in a particular state on different daysof exposure?

Q2How many people got infected in a particular county on differentdays of exposure?

Q3How many people got infected in a particular block on differentdays of exposure?

Q4How many people got infected in a particular state for a givenduration of exposure?

Q5How many people of a given age group got infected in a particularstate on different days of exposure?

Q6How many people of a given age group and gender got infected ina particular state for a given duration of exposure?

Q7How many people of a given age group and gender got infected ina particular county for a given duration of exposure?

Q8How many people got infected in different counties in a particularstate?

Q9How many people got infected in different blocks in a particularstate?

Q10How many people got infected in different counties in a particularstate for a given duration of exposure?

Q11How many people of a given age group and gender got infected indifferent blocks in a particular state?

Q12How many people of a given age group and gender got infected indifferent blocks in a particular state for a given duration of expo-sure?

Table 2.3: Spatio-temporal queries

Chapter 3

Related Work

Taylor [91] points out that analysis of ultra-large-scale data sets will pose great challengesover the years to come. Spatio-temporal analytics in large-scale epidemiological studiesrequires support for data-intensive and compute-intensive processing. [71] highlights thatcurrent advances in computing, Big Data, and computational thinking can greatly influ-ence real-time epidemiological studies. It also points out that network based computationsare compute-intensive and data-intensive. Computational models provide great insight intospace-time dynamics of real-time epidemics. Constructing a realistic synthetic populationfrom varied sources can pose another challenge in such models. There has been considerableamount of development in tools and concepts in this area of research. In this chapter, we willdiscuss the state-of-the-art approaches along with some new concepts, tools and technologies.

RDBMS systems are known to be stable and structured, but expensive. Over the years,they have been our only alternative for data storage and processing. The advent of BigData [68][70][72] posed a new challenge, handling and processing data of tremendous size.As the data size grew, the RDBMS systems became more expensive. The sheer size of dataoverwhelmed the RDBMS systems and it failed to handle the growing demand of Big Data.The size of the data was not the only feature of the Big Data that posed challenges toRDBMS systems, the nature and structure of the data also provided enough fuel to lookfor efficient and inexpensive alternatives. Thus, began the era of MapReduce, distributeddatabases and other such concepts and technologies.

Storing, managing and processing of large volumes of data poses significant challenges. Thus,we see an opportunity for developing a distributed system which has all the desired featuresof handling such large volume of data and provide scalable data processing. These desiredfeatures have been identified to a great extent in [102]. These systems should be ableto provide distributed data storage, efficient query processing and support approximationsolution with provable error bound. The primary focus of our study is spatio-temporalanalytics in epidemiological studies, however, we will also explore several other tools andconcepts across domains which can be applied to our scenario.

12

Chapter 3. Related Work 13

3.1 MapReduce and Hadoop

MapReduce based systems such as Hadoop have been explored and used to a great extent inbioinformatics [91]. They have become an obvious choice when dealing with large datasets.Hence, we explore and discuss the various tools and concepts related to Hadoop. [36] defineMapReduce as a programming model meant for generating and processing large datasetswhere users define the computation in terms of map and reduce functions. It proposes a newabstraction that scales to large clusters of machines, parallelizes and is capable of distributedcomputations with higher degree of fault tolerance. It is also built to handle other issues andchallenges associated with processing and generating real world large datasets. [42] providesa very interesting review of the state-of-the-art in improving the performance of parallelquery processing using MapReduce. It explores the different aspects of query processingrelated to data analytics over massive data sets in MapReduce and provides a detailedclassification of existing approaches based on optimization objectives. [67] characterizes theMapReduce framework and discusses its inherent pros and cons. It also highlights the issuesand challenges related to parallel data analysis with MapReduce.

Hadoop-GIS [18] is a scalable spatial data warehousing system that provides spatial queryingbased on Hadoop MapReduce through spatial partitioning. Fastmod [92] provides a frame-work for efficient query processing and real-time analysis of mobility data using Hadoop[7][66][83]. SpatialHadoop [44] is also based on Hadoop MapReduce with high-level lan-guage support for spatial data. Witayangkurn et al. [97] designed and developed a largescale data management system for handling massive spatio-temporal datasets using Hadoop.The system has three main features - big data, spatial support and data mining which makesit suitable for spatial analysis on mobile phone data. There are many other MR and Hadooprelated tools developed for query processing and optimizations, like, HadoopToSQL [61],HadoopDB [17], Hadoop++ [40], HAIL [41] and others.

[88] [89] highlights the existence of a gap between Big Data Analytics and spatio-temporaldata storage. It points out the fact that the underlying storage systems (i.e. HBase [46],HDFS [27][82]) in Hadoop does not support multi-dimensional range scans. CloST [89] is ascalable big spatio-temporal data storage system based on Hadoop. It used Hadoop insteadof RDBMS systems, due to the fact that it exploits low-level APIs for handling complexdata mining algorithms. It uses three level of data partitioning (Level-1 - bucket, Level-2- region and Level-3 - block). The Level-2 and Level-3 correspond to spatial and temporalpartitioning. The Level-3 partitioning corresponds to the Block File, which is a space-efficient file format on HDFS that compresses data at column-level as well as section-level.OceanST [102] is an in-memory Map Reduce solution which uses a random index sampling(RIS) algorithm to approximate solutions with a guaranteed error bound. [53] points outthat the default configuration of Hadoop is slow, along with the fact that it can be optimizedto run faster . It provides a performance comparison of default hadoop configuration with aconfiguration employing Cloudera Configurator [30]. The experiments were conducted on a10 node clusters for both of these configurations. The results pointed out that for I/O bound


applications, configuration employing Cloudera Configurator was 1.5 times better than thedefault one. However, for CPU bound applications, the performance was marginally better.They also pointed out the various factors that might have influenced these performanceresults, like, maximum number of mappers and reducers, block size, threads, schedulers andothers.

Hive [8] is an open-source data warehousing solution built on top of Hadoop. It supportsSQL-like queries which are compiled into map reduce jobs that are executed using Hadoop[93]. [28] explores to determine the best structure and physical modeling techniques forstoring data in a Hadoop cluster using Apache Hive to enable efficient data access. [51] in-troduces an extension to HiveMetastore by extracting column-level statistics in hive, thereby,improving the overall performance of HiveQL query execution. Murthy et al. [76] address theissue on the high latency for interactive analysis in Hive and propose a distributed query en-gine built on top of Hadoop Distributed File System (HDFS) and Hive-Metastore, Peregrine.Peregrine is a combination of an in-memory, serving-tree based computation framework withapproximate query processing. [62] analyzes the impact of cluster characteristics on HiveQLquery optimization and also proposes Sqoop [94], a tool for transferring data from relationaldatabases to non-relational databases.

Hive has limited options for query optimizations compared to existing parallel databases,like Postgres-XC and others. Inherently MapReduce based tools do not support schemaand indexing, which are an important feature to speedup access to large volumes of data.Hive enables the use of schema and indexing on MapReduce, but it comes with its ownsets of limitations. Unlike automatic index maintenance in Postgres-XC, index handlingand creation is complex in Hive. Hive indexes are implemented as tables and the query hasto be rewritten using the newly created ”index tables”. In case of really large tables, theindex building can get very costly. Similarly, there are various other venues where Hive-based tools lags behind Postgres-XC. Query planning and optimizing plays a crucial rolein query processing. AQUA [99] is a tool based on Hive, that embeds a query optimizerinto Hive to generate efficient query plans for MapReduce-based processing systems. Fora given SQL-like query, AQUA generates a sequence of MapReduce jobs, which minimizesthe cost of query processing. The Map and Reduce operations needs to be customized tohandle the query optimizer. It provides an optimized query plan by reducing the number ofMapReduce jobs, reducing the size of intermediate results and adjusting join sequences byusing cost-based optimizer. However, handling multiple-join queries are complex in AQUAcompared to Postgres-XC as it leads to more shuffling of data during reduce phase, therebyincreasing I/O cost. It also does not address bottlenecks related to index handling andmaintenance. Postgres-XC provides various query processing and optimization techniquesthat can be adopted by AQUA to enhance its query optimizer.


3.2 Distributed and Parallel Databases

Due to recent trends and faith in MapReduce approaches, we have been overwhelmed withMapReduce and Hadoop based tools. However, we would like to shift our focus to the existingstate-of-the-art tools in distributed and parallel databases. Parallel database system [84]can be defined as a database management system (DBMS) implemented on a multiprocessorsystem with high-degree connectivity. [79] provides a comprehensive study of distributed andparallel database systems. [20] discusses design principles and core features of systems formassively-parallel computation and storage techniques on large clusters of nodes for largevolume of data. [63] provides a great insight into issues and challenges of handling largevolumes of data in parallel systems.

RDBMS were efficient when the data size was not as huge as it is in recent times. Webelieve that there is a strong need for relational queries in spatio-temporal data mining.DeWitt et al. [38] point out the fact that relational queries are suited for parallel process-ing. The relational operators in parallel databases supports pipelined-parallelism1 as wellas partitioned-parallelism2. Bubba [26] is a scalable shared-nothing parallel database fordata-intensive operations. Postgres-XC [10][21][64] is another open source shared-nothingcluster system that is build as an extension on Postgres [75][86]. [78] provides the detailedarchitectural design and implementation of the Postgres-XC database. It is an integral partof our framework and will be discussed in detail in the following sections. Tandem [90],Gamma [39], Volcano [47], XPRS [85] are some of works published by research communityon parallel and distributed database systems. Teradata [13] and Tandem [12]( now part ofHewlett Packard) provides enterprise solutions for parallel processing databases.

3.3 NoSQL Databases

In recent times, there has been a lot of advancements in NoSQL databases based on dis-tributed database architecture. They are increasingly explored and used in Big Data andreal-time applications. NoSql does not refer to a single class of database systems, but acollection of different approaches. It can be of various types, like document-store, graph-store, row-store, column store and others. Each of these systems aims to provide us efficientand scalable data handling capabilities. Interestingly, NoSQL database are well suited forhierarchical or connect queries.

Cassandra [57] [1] is an open source row-store distributed DBMS that handles large datasetsdistributed across multiple commodity servers. However, it is not suited for spatio-temporal

1Streaming the output of one operator into the input of another, making these operators work in series.Thus, uniform operations are applied to uniform data resulting in pipe-lined parallelism [38].

2The input data is partitioned between multiple processes and memories, each operator works on inde-pendent part of the data [38].


queries with joins as it does not support joins or subqueries [1]. ClusterPoint DBMS [2] is ascalable NoSQL database server that manages distributed document-oriented (XML/JSON)data stores. It is based on the concept of customizable ranking index and provides a full textsearch option in a distributed cluster database. NuoDB [9] and Riak [11] are cloud-baseddistributed data stores.

Column-store databases [16][81] are another class of NoSQL databases characterized by par-titioning and storing the relations vertically. It accesses only the required columns duringquery processing. They provide efficient data compression, thereby reducing the storage costand making it I/O efficient. Idreos et al. [59] provide a detailed study on the design andimplementation of column-store database systems. C-Store [87], MonetDB [60], Vectorwise[104] are some of the open source column-store database systems. Druid (column-oriented)[3] and FoundatationDB [5] are some of the other distributed NoSQL database systems.

3.4 Other Approaches

There are other non-MapReduce solutions for processing and managing large datasets.Mamoulis et al. [69] highlight the importance of efficient management of spatio-temporaldata for faster access of range queries with temporal predicates. They use a set of indexstructures that uses discovered patterns for efficient management data and processing ofspatio-temporal queries. [43] proposes a parallel analytics server designed to provide highperformance OLAP query engine. Another alternative to Hadoop or distributed databasesis massively parallel processing relational database systems (MPPDBs) in the Big Data an-alytics. Thrifty [98] is a prototype implementation of MPPDB-as-a-service which providesfeatures like, SQL interface, fast data loading, easy scale-out, fast recovery, and short ana-lytical query processing time.

Pavlo et al.[80] studied the prospect of MapReduce on large scale data analytics and comparesit with existing parallel database systems. These two classes of systems are dissimilar inprogramming models, optimizations, data distribution and approaches. It brings out thepros and cons of each of these systems by studying their benchmark execution results andevaluating them. It also provides a detailed performance study and comparison of Hadoop(MR), Vertica (columnar database) [14] and DBMS-X (a parallel database systems). Fromthe performance study, it is evident that Vertica performs better for selection and aggregationoperations compared to parallel databases and MapReduce. This is due to the fact that itaccesses only the required columns and does not have to read the entire relations. However,parallel databases marginally outperforms Vertica in join tasks due to an optimizer bug inthe system for larger number of nodes in the cluster.

Chapter 4

Postgres-XC

Postgres-XC is an open source parallel database which provides a write-scalable, synchronous,symmetric, and transparent PostgreSQL cluster solution [10]. It uses most of the existingmodules of the Postgres-standalone database server. All its components are tightly coupledand can be placed on multiple hardware. The following features make it different from thePostgres-standalone version:

– Write-scalable : as compared to single database server, the users can setup thePostgres-XC server with multiple database servers to scale write operations.

– Synchronous : all the database servers in Postgres-XC are highly synchronized, andany update on any of these servers is immediately visible to any other transactionsrunning on any of these database servers. In case of a Synchronous MultiMaster setup,any database update from any database server is immediately visible to any othertransactions running on different masters.

– Symmetric : it provides a consistent database view to all transactions in the cluster.

– Transparent : a user need not worry how data is stored internally.

4.1 Architecture

[78] The Figure 4.1 gives an overview of the Postgres-XC architecture. Each individualcomponents are discussed in the following subsections.

17

Chapter 4. Postgres-XC 18

Figure 4.1: PGXC Architecture.

4.1.1 Global Transaction Manager or GTM

The GTM provides transaction management and visibility checking using the following fea-tures:

– GXID : globally consistent transaction identifiers to transactions running across thecluster[2].

– Snapshots : provides global snapshots by collecting the status of all the transactionson the cluster.

4.1.2 Coordinator

It is the most important unit of the Postgres-XC server and serves as the master. It doesnot store any data, however, it keeps track of each table’s distribution strategy, catalog data(to decompose the statements), datanode location and other such relevant information. Itprovides an interface for applications and handles the parsing, planning and execution of thequeries. Whenever the applications sends a query statement (Q), the coordinator identifiesthe datanodes and decomposes the query into local SQL statements, referred to as RemoteQuery (RQ) for each datanode. These Remote Queries are executed at the datanodes andthe results are used by the coordinator to generate the output of the actual query, Q. [78]This can be illustrated in the Figure 4.2.

4.1.3 Datanode

They are the workers of the Postgres-XC server, where the data is stored. It executes theRemote Queries on its local data and sends the output to the coordinator.


Figure 4.2: Query Handling in Postgres-XC. The query, Q is broken down into coordinator-operations and Remote Queries, RQ. Each datanode executes the remote query and sendsthe result to the cooridnator. The coordinator consolidates the results from all the datanodesand generates the output of the query.

4.1.4 Single-Compute node and Multiple-Compute node setup

The Postgres-XC can be setup on single compute node or on multiple compute node. Insingle compute-node, all the components of Postgres-XC (i.e. coordinator, gtm, gtm proxyand datanodes) are setup on the same compute node. In case of multiple-compute nodesetup, the components are located on different compute nodes.

4.2 Query Planner

Query planner is an integral part of any database. They provide us with a set of plans fora given query and help us estimate the cost of executing it. It plays a very vital role inQuery Life Cycle , [32] refer Figure 4.3. The planner utilizes the following operations, whichcontribute towards the cost:


– Scans : accessing table is one of the most costly operations and hence, the QueryPlanner needs to find an optimized strategy. The following access methods are availableto the Query Planner to choose from:

– Sequential scan : each row of the table is read.

– Index scan : only a fraction of the entire table is read using index.

– Bitmap scan : all the index rows are scanned consequently, populating the bitmap.Once the bitmap is populated, the table is accessed sequentially.

Figure 4.3: Query Life Cycle.

– Joins : finalizing the join strategy and join order is an important activity for theQuery Planner. The Query Planner can use the following strategies:

– Nested loops : it is used to join two tables where one table is scanned once forevery row found in the other table.

– Sort-Merge join : sorts the tables and scans both tables in parallel for equalvalues.

– Hash join : one table is scanned and put into a hash table. Each row in the othertable is used as hash keys to locate the matching rows in the hash table.

– Aggregates :


– Grouping via sorting : it sorts the data or uses the pre-sorted data for aggregation.All new values are aggregated to the prior group.

– Grouping via hashing : each row of the input table is inserted in the hash tablebased on the grouping columns, finally all these groups are aggregated.

4.3 Performance Tuning

Query performance in Postgres-XC can be affected by various parameters. These parameterscan be useful while building the query plan. Some of these performance parameters can betuned, whereas, some of them cannot be altered as they are fundamental to the design ofthe database system. Some of the parameters at different database server configurations arediscussed below:

• Connections and Settings :

The parameters for this server configuration are mostly related to settings - ports,sockets, TCP and others. Most of these parameters are not directly related to perfor-mance tuning. However, the max connections is an important parameter especially fordistributed database systems as it determines the maximum number of concurrent con-nection in addition with its lock space requirement. A higher value for this parameterwill demand more system shared space or semaphores. The number of datanodes inthe Postgres-XC is directly related to this parameter. Hence, it needs to be consideredbefore configuring the datanodes on Postgres-XC

• Resource Usage : It takes into account the different resources, like memory, disk,kernel, background writer and other related resources, each having it own set of linkedparameters. The parameter shared buffers is a measure of how much memory is ded-icated to database use for caching data. It sets the amount of memory used by thedatabase server for shared memory buffers. Higher value for shared buffers providesbetter performance ( it can be over and above 50% of the system memory).

The maintenance work mem is the amount of memory allocated for maintenance oper-ations i.e. vacuum, index creation, altering tables, etc. A higher value is recommendedas it leads to improved performance while loading large amount of data. Anotherimportant parameter to consider is the work mem. It specifies the memory used byinternal sort operations and hash tables before switching to temporary disk files. Forcomplex queries, multiple sorts and hash operations might run in parallel. Each ofthese operations will be using the memory specified by this parameter to move datainto temporary files. Hence, It is advisable to have a higher values for this parameter.

• Write Ahead Logs (WAL) : There are various paramaters to be considered at thislevel of configuration. The parameter fsync makes sure that all updates on the server


are physically written to the disk. Turning it off contributes towards the performanceof query processing. There are other parameters related to settings, checkpoints andarchiving in Write Ahead Logs that can be of interest.

• Query Tuning : This is very important and critical for query planning activities. Itallows the user to tune parameters at different levels - Planner Method Configuration,Planner Cost Constants, Genetic Query Optimizer and other planner options.

The parameters in Planner Method Configuration enable us to handle joins, scans,sorts and aggregate operations. Sorting and sequential scans are expensives steps.The enable sort parameter can be used to enable or disable query planner’s sort stepsin the query plan. Turning it off will make the query planner consider other alternativesas well. The enable seqscan parameter is used to handle sequential scans. By defaultit is kept ON, disabling it will lead the query planner to use index scans rather thanother sequential scans.

Planner Costs constants also play an important role in selecting the best query plan.It also helps us in analyzing the cost of different operations. The effective cache sizeparameter approximates to the memory left over for disk caching after taking intoaccount the memory used by the operating system, dedicated Postgres memory, andother applications. It assists the query planner by providing the memory statistics forcaching data. It can be set to a value between 50 − 75% of the total memory. It isimportant to note that if its value is too low, indexes might not be used as expected.The cpu index tuple cost parameter sets the planner’s estimate of the cost of processingeach index row during an index scan. It can be lowered by a factor (100% or more)of the default value (0.001). Some of the other costs constants of importance include- cpu tuple cost, random page cost and cpu operator cost.

It is interesting to note that by changing these parameters we can only share our preferencewith the query planner. The final decision lies with planner, it evaluates every possible queryplan using the parameters discussed above and chooses the best query plan.

Chapter 5

Count-min sketch or CM sketch

Count-Min or CM sketch was proposed by Cormode et al. [34][35][33] as a sublinear spacedata structure for summarizing data streams. It provides approximate summarization ofdifferent types of queries in data stream – point, range and inner product queries. Thespatio-temporal queries are typically point and range queries. Efficient aggregation is animportant aspect of spatio-temporal analytics. The CM sketch provides another approachfor handling the COUNT aggregate (GROUP-BY) queries.

5.1 Data structure and Operations

The CM sketch data structure is defined with parameters (ε, δ) [34]. It is simply representedby a two dimensional array count [w, d ], where w and d represent the width and depth ofthe array. Initially all the elements of the count are set to zero. The width and depth of thearray is set to w = (e/ε) and d = ln(1/ δ) respectively. The CM sketch data structure alsohas d hash functions which are chosen randomly from a pairwise-independent family and aredenoted as:

h1 · · ·hd :{

1 · · ·n}→{

1 · · ·w}

Figure 5.1: CM Sketch with Update operations.

23

Chapter 5. Count-min sketch or CM sketch 24

Every item is mapped to one cell in each row of the array using hash functions hj as givenin the figure 5.1. In case of a new item, say x, each cell corresponding to this item needs tobe updated in the array. The update operation can be written as follows:

∀1 ≤ j ≤ d : count[j, hj(x)]← count[j, hj(x)] + 1

The estimate operation is similar to the update operation. In case of a point query, we needto find the minimum count across all the rows corresponding to the element it in the array.This is again done using the hash functions, which maps the element with its correspondingcell in each row of the array. For a given query point i, the estimate can be written asfollows:

ai = minj count[j, hj]

Where a is the element in the vector for which the count is desired. This estimate has theguarantee that ai ≤ ai + ε|a| with probability 1− δ.

For constant ε and δ, the space complexity is polylogarithmic for CM sketch data structure,O((

eε

).ln(1δ

)). The time complexity for operations like update and estimate is of the order

of O(ln(1δ

)).

5.2 MADlib implementation using UDA and UDFs

The paper [31] emphasizes on tightly integrating statistical computation with parallel databases.This paper was further extended to result in an open source library, MADlib [54]. It is asuite of sql-based algorithms for analytical methods, which runs within a database engine.Hellerstein et al. [54] provides a wide range of data parallel algorithms for sophisticatedstatistical techniques.

The MADlib provides an implementation of CM sketch for Postgres DB standalone versionthat complies with the approximation guarantees in [34]. The building blocks of the CMsketch module in MADlib Project [54] are based on UDAs (User-Defined Aggregates) [48]and UDFs (User-Defined Functions) [49]. The UDA cmsketch(int col) takes a column (typeinteger) of a relation as an input and produces a string , which is a large array of counters.There are multiple UDFs for different operations in MADlib Project like, cmsketch count,cmsketch rangecount, cmsketch percentile and others. For example, let us consider the UDFcmsketch count(sketches text, int val). It takes the string from cmsketch as an input andcomputes the approximate frequency of the desired item (val) in the string, thus, giving usthe approximate count of the item val in the column. This is illustrated in the figure 5.2 .


Figure 5.2: cmsketch count - is a scalar User Define Function (UDF) in MADlib Library tocompute the approximate number of occurences of a value in a column summarized by acmsketch.

5.3 CM Sketch example

Let us consider an input data stream consisting of the following elements:

12, 13, 13, 34, 56, 12, 25

The hash functions for CM sketch are given below:

H ={h1 ,h2 ,h3 ,h4}

The count [w,d ] array is given in figure 5.3a, and all elements are initially set to 0. For eachelement in the data stream, the hash functions are applied and the array is updated.

(a) (b)

(c)

Figure 5.3: Array count [w,d ] for CM Sketch.

For example, let us consider the first element of the input data stream, 12. The hashfunctions generate the following output for the first element:


h1(12) = 1, h2(12) = 1, h3(12) = 1, h4(12) = 2

∀1 ≤ p ≤ d, hp(x) = k corresponds to the cell (p,k) in the count [w,d ] array. For example,h1(12) = 1 will increment the counter in cell (1,1 ) in the array. All the values generated bythe hash functions for 12 are updated in the count [w,d ] as given in 5.3b.

Similarly, hash functions on the other elements generate the values as given below:

h1(13) = 2, h2(13) = 2, h3(13) = 2, h4(13) = 4

h1(13) = 2, h2(13) = 2, h3(13) = 2, h4(13) = 4

h1(34) = 1, h2(34) = 2, h3(34) = 3, h4(34) = 5

h1(56) = 1, h2(56) = 3, h3(56) = 1, h4(56) = 7

h1(12) = 1, h2(12) = 1, h3(12) = 1, h4(12) = 2

As these values are generated, the counter in count [w,d ] array is updated corresponding tothe hash values for each element. The table 5.3c gives the final values in the count [w,d ]array.

The count of any element, say x, in the data stream, can be found by estimate(x ) =minj count[j, hj(x)]. The count for the element in the input data stream for x = 13 isgiven by:

estimate(13) = min(h1(13), h2(13), h3(13), h4(13))

estimate(13) = min(2, 2, 3, 2)

estimate(13) = 2

Thus, we can see that the output of the estimate operation for x = 13 is in accordance withthe frequency of the given element in the input data stream.

Chapter 6

Hive

Hive is a distributed relational engine designed for data and compute intensive workloads. Itis based on MapReduce (a data-parallel system) which, over the years, has gained ubiquitouspresence and is synonymous to Big Data computing. Due to the use of MapReduce engine,the processing of relational queries in Hive are performed in bulk synchronous data-parallelstyle of computation. This also gives them the ability to scale out to massively large numberof compute nodes along with inherent fault tolerance built into the MapReduce engine.

6.1 Architecture and Setup

We have setup myHadoop [65] implementation of Hadoop on the cluster, which uses tradi-tional HPC resources. Resource allocation for the setup is done using TORQUE (PBS). Oneof the advantages with this flavor of hadoop is that we dont need root-level access to set thisup. It allows multiple users to run their jobs simultaneously without any bottlenecks. Oncethe myHadoop setup has been configured on the allocated computed nodes in the cluster,we set up Hive on top of it.

[93] The Figure 6.1 illustrates the architecture of Hive. Hive has two main components -External Interfaces and the Driver. HiveQL is a SQL-like declarative language supportedby Hive. The External Interfaces receives the query and the driver sends it to Hadoop forprocessing. We have briefly discussed the key components as discussed below (refer to [93]for more details):

• External Interfaces : It connects Hive with external applications or tools using thefollowing channels:

– Thrift Server : It integrates applications or tools with Hive using APIs, likeJDBC, ODBC and others. It receives the queries and returns the output to the

27

Chapter 6. Hive 28

Figure 6.1: Hive Architecture.

application.

– Command Line Interface (CLI) : Applications or tools can send their queriesusing the command line option.

– Web Interface : It is a Graphical User Interface (GUI) and provides an alternativeto CLI.

• Driver : It manages the entire query life of the HiveQL statements. The entireprocessing and query handling is done by the following components:

– Compiler : It generates the query execution plan from the HiveQL statementsand sends it to the Optimizer. Once the Optimizer returns the optimized plan,it compiles it and generates a directed acyclic graph of map-reduce jobs. It isimportant to note that the query planning activities in HiveQL are very similarto that in Postgres-XC.

Chapter 6. Hive 29

– Optimizer : Evaluates the execution plan and finalizes the optimized executionplan for the given query.

– Executor : It interacts with Hadoop and executes the jobs in proper dependencyorder as specified by execution plan generated by the Compiler.

• Metastore : Stores the system catalog, which contains the schemas and statistics.

It is important to know the fate of the jobs submitted to Hadoop. The output of the executionof these jobs on Hadoop constitutes the query result. Hadoop has two main components -Hadoop Distributed File System (HDFS) and MapReduce. HDFS runs on top of the entirefile system of each node in the cluster. The MapReduce component handles the Map andReduce operations, and is responsible to parallelism in Hadoop. The nodes in Hadoop are ofof two types - MapReduce nodes and Hadoop Distributed File System (HDFS) nodes . Wecan see from the Figure 6.1 that the Job Tracker and the Task Tracker are MapReduce nodes.There is only one Job Tracker and multiple Task Tracker nodes in Hadoop. The NameNodeand DataNodes are HDFS nodes. The HDFS is a Master-Slave architecture, and consistsof one NameNode and multiple DataNodes in a Hadoop cluster. The NameNode serves asthe Master node and manages the filesystem namespace and metadata. The DataNodes isresponsible for managing and storing its allocated block of data.

The Job Tracker receives the map-reduce jobs from the Executor. It schedules and monitorsthese jobs on the Task Tracker node. Each Task Trackers runs the Map and Reduce taskson the different datanodes. Once these tasks are completed, the Job Tracker updates theExecutor. Finally, the Executor fetches the results and sends it to application or tool throughthe External Interfaces.

6.2 Query Processing

Hive employs mappers and reducers for query processing. [67] The Figure 6.2 clearly illus-trates the Map and Reduce phases of query processing.

The tables are stored in HDFS are partitioned into small chunks of data or splits. This allowsthe Map and Reduce tasks to work in parallel on independent data. The Mapper loads therequired partitions or splits from the HDFS and transforms it into maps. These maps aresorted and stored as intermediary files in the local disk. These files are used by Reducers forfurther processing. The Reducers merges these files from different mappers and generatesthe output file, which is stored in the HDFS. The Map Operator Tree and Reduce OperatorTree describes the operations in the individual Map and Reduce jobs, and are discussed inthe following section.

Chapter 6. Hive 30

Figure 6.2: Map and Reduce in Hadoop.

6.3 Execution Plan

We have already established in the previous section that Hive’s contribution towards queryprocessing is limited to generation of the map-reduce jobs for Hadoop to process. Thus, itbecomes imperative to study the execution plan to understand the true nature of the queryprocessing.

Figure 6.3: Stage Plan for Query Q6 on Hive. Each individual stage consists of Map-Reducejobs to be executed on Hadoop cluster.

The Hive Compiler parses the query and translates it into an Abstract Syntax Tree (AST).The Semantic Analyzer in Compiler translates the AST into a Directed Cyclic Graph (DAG)

Chapter 6. Hive 31

of MapReduce Tasks. [93] The Figure 6.3 depicts the Stage Plan for query Q6 (we havetranslated the actual query to HiveQL). The different stages and their dependencies aregiven below:

STAGE DEPENDENCIES:

Stage -1

Stage -2 depends on stages: Stage -1

Stage -3 depends on stages: Stage -2

Stage -0 is a root stage

We can see from the Stage Plan that the entire task is divided into four stages. The Stage-0is the root stage and is a File System related stage whose task is to move the results from atemporary directory to the destination path. Stage-2 is dependent on the output of Stage-1,whereas, Stage-3 is dependent on Stage-2. From the Figure 6.3, we can see that there arethree Map-Reduce stages - Stage-1, Stage-2 and Stage-3. Each of these stage comprises of aMap Operator Tree (Map) and a Reduce Operator Tree (Reduce). The Map Operator Treecorresponds to Map tasks and tells the mapper to which operator tree to call in order toprocess the rows from a particular table or result of the previous stage. The output of theMap task is a Reduce Output Operator, which also serves as the input to the Reduce taskof the same stage. Similarly, the Reduce Operator Tree corresponds to Reduce tasks andprocesses all the rows on the Reducer of the Map-Reduce job. The output of the Reducetask is a File Output Operator, which fed as an the input to the next stage.

The Map operation in Stage-1 scans the participating tables and generates Reduce OutputOperator. The Reducer Operator Tree carries out the hash aggregation. The Map tasks inStage-2 and Stage-3 are similar. The Reduce task in Stage-2 merges the partial aggregatescomputed in Stage-1. Stage-3 also performs its Map and Reduce Tasks before the results arefetched in the root stage, Stage-0.

Hadoop returns the output of the job execution to driver in Hive. Hive uses the externalinterfaces to display the result to the user.

Chapter 7

DiceX Framework

7.1 Introduction

The DiceX framework provides a one-stop solution for database management and process-ing. It is designed to effectively manage and store large volumes of spatio-temporal data andprovide data and compute intensive processing. It provides an alternative to MapReducebased tools, and saves us the labor of writing Map and Reduce programs. This tool exploitsexisting relational database technologies and concepts as they are more suited for epidemi-ological analytics [25]. It can be easily deployed on clusters, making it scalable and capableof parallel processing. In the following sections we will provide an architectural overview ofDiceX, data management and processing along with some issues and challenges.

7.2 Architectural Overview

The architecture of DiceX is simple in setup and design, which is why it is easily configurableyet capable of handling compute and data intensive operations. Postgres-XC is the keycomponent of the DiceX framework and is highly integrated with the other components. Aswe have discussed in section 4.1.4, there are two possible configurations for setting up thePostgres-XC clustered database - (a) single compute node and (b) multiple-compute nodesetup. Both the configurations are identical in all aspects, except that the multi-computenode setup is more suited for large data storage and processing as it allows users to scalehardware resources. The datanodes in Postgres-XC provide the data storage and processingengine. Another important component of DiceX is the vanilla version of postgres, Postgres-standalone. The foreign data wrappers are not supported on the Postgres-XC, however, theycan be utilized via Postgres-standalone. It serves as a conduit during data migration fromexternal databases. Its use will be more evident when we explain the data migration process.

32

Chapter 7. DiceX Framework 33

The Figure 7.1 gives a detailed overview of multi-compute node DiceX setup.

Figure 7.1: DiceX Framework.

In this setup, the components of DiceX is distributed across multiple compute nodes inthe cluster. The setup configuration (i.e., base port, number of compute nodes, number ofnodes per compute nodes and others) are provided by the Configuration file. It can also beused to set DB level parameters for Postgres-XC. The Driver scripts launches DiceX on thecompute nodes using the details from the Configuration file. In this setup, the Postgres-XCcoordinator and the Postgres-Standalone are setup on the same compute node but are ondifferent ports. The datanodes are setup on each of the compute nodes. The entire DiceXframework is based on the Master-Slave concept, where the Postgres-XC coordinator acts asthe master and the datanodes as slaves.


7.3 Coupling Distributed Databases on Supercomput-

ers

The DiceX framework launches the Postgres-XC on the sfx cluster (for specifications, referSection 8.1.1) using Job Arrays. Job Arrays are helpful in rationing the compute nodes inthe cluster for setting up the Postgres-XC database. The user can request for the number ofcompute nodes and number of processors per compute node. Once the nodes are allocated,the DiceX is setup on those compute nodes.

We present some of the issues we face when using supercomputers to address both thecompute and data intensive requirements. As mentioned earlier, friction occurs primarilydue to resource management and design of the supercomputer.

7.3.1 Resources, Quota, and, Allocation Management of Jobs

Almost all modern supercomputers allocate much more of these resources (computing re-source like CPUs/cores, memory, disk, bandwidth and etc.) than the task requires. Hence,the allocation system only actively manages the wall time and the number of compute nodesrequested by the machine.

Data intensive systems, however, need and request many other resources that a job allocationsystem needs to track and maintain. These include:

• File descriptors to maintain state of open files

• Semaphores, inter-process pipes, shared memory

• Main memory buffers

• TCP/IP ports

• Processes

By default, the current OS and runtime either do not keep track of the resource or set it toolow of a value for it to be useful for data-intensive computing. Following scenario illustrateswhy current setting turns out be impedance mismatch with the application requirement -one of the standard technique to address latency problem when reading very large scaledatasets is to launch multiple threads which, independently, handles a portion of the readworkload and thereby increases the overall read throughput. It turns out that in order tomaximize throughput in current systems a large number of threads is required. For example,in a setup having SSDs with 50K IOPS and CPU with 3.00 GHz the number of threads foroptimal throughput is close to 4096. Current OS does not support such large number of filedescriptors.


Similarly, unlike compute application, data-intensive processing is distributed across multipleprocesses each of which handles different jobs. For example, each datanode consists of alogger, autovaccumer, and multiple query processes. These processes use memory mappedbuffers and semaphores for sharing and synchronization. Current systems allocate verysmall fraction of the resource for this. Without significant number of semaphores and sharedbuffers, data-intensive processing performs poorly or may lead to frequent deadlocks.

Another aspect different from distributed computing is the communication layer. Tradi-tional computing uses MPI while current data-intensive system typically use TCP/IP ports.A single system may require a pool of ports so that its various components can communicatewith each other. For example, in Postgres-XC, communication happens between coordina-tor and datanodes, between datanodes and the global transaction manager, and betweenconnection pooler and the coordinator. The ports , much like disk space on compute node,becomes a resources that needs to be managed and allocated to prevent any interferenceduring communication between different components of Postgres-XC.

7.3.2 Remote vs. Local Persistent Storage

In most supercomputing systems, storage of datasets (input, configuration, output, etc.)is made available via GPFS, a networked based file system, i.e., it provides a file system“interface” to the processes on the compute nodes. The file system, visible to processes atthe compute node, physically resides on storage media attached to another machine (calledstorage server). The machine is connected to all the compute nodes via a TCP/IP network.The GPFS performs the task of migrating data to and from the compute node and thestorage server in order to service a process request for read and write to the files.

The GPFS based approach to handle storage works well for computation in physical scienceswhere the input and outputs are fairly small (typically 1 – 10 GB) and only perform bulkread (as input at the start of the computation) and bulk write (as output at the end ofthe computation). On the contrary, data-intensive systems have large input sizes and largeoutput size. Moreover, unlike computation in physical sciences, access to data is interleavedwith computation. In extreme scenario, such as BFS over large graph, the processing isstream of small reads and almost negligent computation interleaved between them. In suchscenarios, GPFS and other similar approach, turn out to be a significant bottleneck thatseverely restricts either scalability or the type of data-intensive computation that can becarried out on supercomputers.

7.3.3 Managing Port Conflicts

If multiple sessions of database instances are running then it is quite possible that port ofone datanode or coordinator may conflict with another one. We have resolved this issue by


deploying a port server. The port server maintains a pool of inactive ports. A datanode orcoordinator before initializing, contacts the port server which assigns a unique port from thepool of inactive ports. Once the database instance has finished computation, the ports arereturned back to the server and are added to the pool of inactive ports.

7.4 Data Management

DiceX provides an efficient Data Storage and Management platform for datasets with largevolumes. We will discuss some of the concepts and strategies in this section.

7.4.1 Data Storage and Distribution

The data storage and distribution strategies is governed by the user input. DiceX offersreplication and distribution strategies for storing data.

DISTRIBUTE BY { REPLICATION | ROUND ROBIN | { [HASH | MODULO ] ( column_name ) } } ]

Choosing the distribution strategy is an important task before setting up the tables on thePostgres-XC server. This has direct implications on the performance of the queries. Theycan be described briefly as given below:

– Replication - this strategy involves replication of each of the rows in the table tothe datanodes on the Postgres-XC server. It results in costly Write operations butfaster Read operations. Hence, we can say that it serves well for relatively static tablesand/or high read load.

– Distribution in this approach, the rows of the table are distributed between thedatanodes. It allows parallel Writes, but can lead to costly Reads. Distribution oftables is governed by the distribution approaches given below:

– Round Robin: We use Round Robin method to place rows of the table on thedatanodes. It does not require any distribution key and ensures even distributionof rows across the datanodes.

– Hash: The hash values are calculated for each row of the specified column, whichdecides which datanode the row goes to.

– Modulo: Each of the rows of the table are placed on the datanodes in Postgres-XCserver based on the modulo of the specified column.

– There are some other approaches as well , such as Range and User Defined Func-tions.


7.4.2 Data Ingestion

Once the database instance is up, the tool proceeds to populate the database with the tables(or files) mentioned in the datasource. Typically, data ingestion in a database is performed byexecuting insert command on each row of the tables. If the inserts are performed sequentially,the overall time for populating the database would be quite high. It would defeat the purposeof having database withing the supercomputer since all the efficiency gains would be lostdue to high insertion costs.

Data loading using Foreign Data Wrappers

The DiceX framework has been specially built to migrate data from existing database toPostgres-XC. The Postgres-standalone along with its foreign data wrappers (fdws) [55] playsa vital role in data migration. Foreign data wrappers provide a mechanism to connectto external datasources. They are not supported on Postgres-XC, however, the Postgres-standalone supports them. This also justifies why Postgres-standalone is a part of thisframework. The data loading can be illustrated in the figure 7.2:

Figure 7.2: Data Migration - Standalone (Oracle) to Parallel (Postgres-XC) databases.

In the example above, we migrate the data from external datasources (oracle, files) toPostgres-XC via Postgres-standalone. For simplicity, let us consider data migration fromOracle to DiceX (movement from files follows the same procedure).The entire process canbe explained as given below:

– Let Ts and Td be the source and destination tables respectively.

– We need to setup up foreign server objects pgxc server and oracle server on Postgres-standalone.


– Creating foreign table FT on oracle server server in Postgres-standalone for Ts tablein Oracle.

– Creating foreign table FT’ on pgxc server server in Postgres-standalone for table Tsin Postgres-XC.

– Once the foreign data wrappers are set, the three components are in tandem. Insertingdata from FT to FT’ populates Td table in Postgres-XC.

Once the foreign data wrappers are set, the three components are in tandem. The Postgres-standalone plays no role in the query processing and is used only during data migration.The Postgres-XC server handles all the queries processing.

Parallel Ingestion

Large volume of data might lead to extremely high loading time. We tried to optimize dataingestion costs by parallelization the process of data insertion. This can be achieved byusing our Multiple-View approach. To parallelize data insertion, we partitioned the tablesand performed insertions on these partitions concurrently, i.e., launching several processes(one per partition) where each process executes insert command for rows in the correspondingpartition. Distributed databases, by design allow simultaneous execution of commands andqueries.

Figure 7.3: Parallel Ingestion using multiple views.


The Multi-View approach is illustrated in the Figure 7.3. In this approach we divide the datainto smaller chunks of approximately similar size. We then create views in oracle databasefor each of these data chunks. Once the views are created, we create multiple concurrentprocesses (one for each view) to insert data into the tables in Postgres-XC server.

The Figure 7.4 shows time to insert dataset (when stored in file or in remote database) byvarying the number of partitions. The table is a simulation output with rows occupying 5GBin space.

Figure 7.4: DiceX allows optimized data ingestion in a table from file or a table in a remotedatabase by partitioning the data source and ingesting each partition concurrently. Plotshows performance varying with the number of partitions when remote datasource is filevs. a table in a database. Although, the ingestion scales with number of partitions, it isinteresting to observe that both file and remote database have similar costs even though theingestion from remote database has several additional steps in addition to data traveling alarger distance. This is possibly because high insertion cost in the Postgres-XC itself whichdominates the cost of read and transfer. The Y-Axis is in Log-Scale

Data loading using COPY command

Another method for ingesting data is by using the copy command. Unlike inserts whichare applicable for both remote tables and files, copy is only applicable for data stored infiles. There is another difference between copy and insert, with respect to triggers, whichturns out to be crucial for our problem of efficient ingestion. Triggers are predefined rulesthat are fired if certain conditions are met. Databases, upon modification to its tables or


relations (update or inserts), perform various checks and fire appropriate triggers. However,the copy command is an exception. Even though the command modifies the data, it doesnot checks for triggers. This turns out to scale ingestion considerably faster, especially forthe distributed database.

(a) without index

(b) with index

Figure 7.5: Performance of parallel ingestion using copy command. The Y-Axis is in Log-Scale


Figure 7.5 shows performance of ingestion of social contact network for various states usingcopy command. Two setups are considered - one in which tables does not have indexes andthe other, in which they have. It is important to note that in the latter case, the indexes arebuilt before the data ingestion. The three datasets are social contact network over syntheticpopulation, namely CA, TX, and, MO each with 26GB, 17GB, and, 4.2GB size respectively.We see that using multiple inserts on indexed tables take roughly the same time as non-indexed tables. Moreover, this approach also brings down setup time to less than 100 secsfor large datasets allowing for analytics to be answered in almost real time (see figure 7.4).The copy performance scale linearly with the datasets and is an order of magnitude fasterthan insert command. Hence, ingestion from file using copy command is preferred thaningestion from remote database.

7.5 Query Processing

We have developed DiceX, that sets up a distributed relational data engine within a super-computer over given data sources. It enables scalable data intensive processing by utilizingsupercomputing hardware. Currently DiceX supports two different kinds of data engines.These include (multiple) standalone postgres and distributed postgres. DiceX has severalfeatures to help efficiently setup the computing environment. It allows for ingestion of datafrom multiple data sources. The data source can either be collection of files, or tables in ex-isting external database. It is also capable of ingesting data from the data source in paralleland populate the database. Once the ingestion is completed, the system is ready for dataintensive computing.

The tool has several parameters to configure it for various workload types:

• -fs [ram — disk — network] : specifies the storage media type for the datanodes. The“ram” option creates the storage in the main memory. “disk” will create it on the localdisk. “network” creates it on the GPFS network shared file system.

• -num dn k : number of datanodes per compute node

• -num cn c : number of compute nodes involved in processing. The tool would requestc+ 1 compute nodes since the coordinator and gtm are on separate compute node.

• -coord [fat — regular]: specifies whether the coordinator is launched on the fat nodevs. on the regular compute node. The “fat” option enables computing distributedacross set of heterogeneous hardware.

• -fs [file — db — sim] : specifies the type of the data source. Option “file” specifies thatthe data source is a file, while “db” specifies an external datasource. Option “sim”specifies that data is output of a simulation. For each type there are additional optionsthat specify the location of the data source.


• -port p : the port that client will use to connect to the database

Query processing in pgxc system typically involves the processing engines at the both datan-odes and the coordinator. The coordinator builds the distributed query plan and orchestratesthe query processing: It finds the datanodes required for answering the query based uponthe distribution of the tables involved in the query and the semantics of the query. It thensends the query fragment to be executed at datanodes and waits for their results. Finally,it performs post processing, i.e., execute query fragment on the results from the datanode.Complex queries may involve several rounds of processing back and forth between the datan-ode and the coordinator. We say the fragment of the query being executed at the datanodesas being pushed down. In order to achieve scalability, it is necessary that a significant fractionof the workload is pushed down to the datanodes.

In addition to parallelizing computation over a single datasource, DiceX framework allowsfor processing same computation on multiple data sources. In order to do so we specify adata source generator which lists all the datasources over which the computation needs to beperformed. The DiceX scripts using the job management system will launch a separate jobfor each datasource. This approach allows for massively parallel data-intensive processingwithout having to worry about low level data management and computation. For example,using this approach for testing and validation of generated synthetic data for all the 50 statescan be issued concurrently and independently. This leads to at least an order of magnitudeimprovement in completion of the analysis [52].

7.6 Indexing and Clustering

Indexing and Clustering are important query optimization techniques that can be employedfor spatio-temporal data analytics. DiceX uses Postgres-XC database engine, which supportsboth these approaches. Creating effective indexes may result in better performance bymaking the fetch (or read) operations faster. The different types of indexes supported byPostgres-XC are B-tree, Hash, GiST and GIN. GIN and GiST are used for full text searchesand are not suited for spatio-temporal queries. GiST can be used on either a tsvector 1 or atsquery 2 column whereas GIN can only be applied to a tsvector column. Hash indexes canbe employed for simple equality operations. However, the B-tree indexes are best suited foroptimizing query performance on spatio-temporal data. It supports and provides efficientindexing for point and range queries compared to other indexing options.

Clustering is used to cluster the relations (or tables) based on the specified index (which

1A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to makedifferent variants of the same word look alike. Sorting and duplicate-elimination are done automaticallyduring input [78].

2A tsquery value stores lexemes that are to be searched for, and combines them using the booleanoperators. [78].


should have been already defined) and only once. It re-sorts the tables using the indexresulting in ordering of the physical data. The purpose of clustering is to store the data thatis used together on the same disk space. If the index identifies the first matching row onthe table page, all other rows that match are probably already on the same table page [78].This saves us the cost of disk accesses and enhances the query performance. The impact ofindexing and clustering is further illustrated in Section 8.3.2.

7.7 CM Sketch

Parallel processing power of distributed or parallel databases can be useful in scaling statis-tical and analytical processing of large scale data. CM sketch can be a useful tool for largescale data processing if integrated with parallel databases. The CM sketch data structureappears to be suitable for multi-threaded operations as its inherent operations are addingand counting. The single thread implementation of cmsketch for Postgres is available inMADlib Project [54]. We implemented and extended the cmsketch module from MADlibProject for Postgres-XC.

7.7.1 Implementation on Postgres-XC

The CM Sketch implementation in Postgres-XC uses state values (STYPE) and two or threeUDFs - state transition function (SFUNC), optional state final function (FINALFUNC) andstate collection function (CFUNC). They can be created by executing the following psqlstatement at the coordinator :

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (

SFUNC = sfunc ,

STYPE = state_data_type

[ , FINALFUNC = ffunc ]

[ , CFUNC = cfunc ]

[ , INITCOND = initial_condition ]

[ , SORTOP = sort_operator ]

)

Some of the important parameters for creating aggregates are discussed below:

• input data type : is the input data-type on which the aggregate function operates.

• state data type : is the data type for the aggregate’s current state.

• sfunc : It is a state transition function that is called for each input row. It takes thecurrent state value (of data-type state data type) and the current input data value(s)(of data-type input data type), and returns the next state value.

sfunc( internal-state, next-data-values ) → next-internal-state


• cfunc : It is a state collection function that is called for each input row. It takes thecurrent collection state value and current transition value as arguments and returnsthe next collection state value. All the input arguments and return type argumentsare of data-type state data type. This function also decides the aggregation modes inPostgres-XC (refer Section 7.7.2).

cfunc( internal-state, internal-state ) → next-internal-state

• ffunc : It is the final function that computes the aggregate’s results after all theinput rows have been traversed. It takes the current ending state value (collection ortransition) and returns the aggregate’s results. If it is not specified, then the endingstate value is used to compute the aggregate’s results. The input and return typearguments are of data-type state data type.

ffunc( internal-state ) → aggregate-value

7.7.2 Aggregation Modes

The data in Postgres-XC is distributed across the datanodes. The cmsketch can be createdeither at the coordinator or at the datanodes. These two different modes for processingcmsketch aggregation in Postgres-XC are discussed below:

(a) (b)

Figure 7.6: Two-phased and Three-phased aggregation modes for CM sketch on Postgres-XC.

• Two-phased Aggregation: It is similar to the MADlib Project implementationof cmsketch where the coordinator collects the data from the datanodes and createsthe sketch, refer Figure 7.6a. In this mode the entire aggregation takes place at thecoordinator. This mode of aggregation is divided into two phases:


– Phase 1 - This phase is called the transition phase and is carried out at thecoordinator. In this phase, the Postgres-XC creates a temporary variable (of typestate data type) to hold the current internal state of the aggregate. The aggregateargument value(s) are calculated for each input row. The sfunc is invoked tocalculate the new internal state value.

– Phase 2 - Also called the finalization phase, invokes ffunc to calculate the aggreg-gate’s results. If ffunc is not specified, this phase returns the ending state valueas-is.

• Three-phased Aggregation: We implemented another flavor of the cmsketch whereeach datanode builds its own sketch on the its local data. Once all the sketches fromindividual datanodes are built, they are sent to the coordinator. The coordinatoraggregates these individual sketches from the datanodes to generate the final sketch ofthe data, refer Figure 7.6b. This mode of processing aggregation on cmsketch can alsobe referred to as cmsketch pushdown. It can be divided into three phases:

– Phase 1 - It is the transition phase which is similar to Phase 1 of Two-phasedAggregation, but carried out by the datanodes on its local data. The results ofthis phase is transferred to the coordinator for further processing.

– Phase 2 - This phase is also called the collection phase and takes place at the coor-dinator. The Postgres-XC creates a temporary variable (of type state data type)to hold the current internal state of the collection phase. The cfunc is invoked tocalculate the new internal collection state value using the current collection statevalue and the new transition value obtained from the datanode. All the transitionvalues from the datanodes are processed in this phase.

– Phase 3 - This is the finalization phase which invokes ffunc to calculate theaggreggate’s results. If ffunc is not specified, this phase returns the ending statevalue as-is.

7.7.3 Query Plan

The query plans provide great insight into the structure of the query. We have generated thecmsketch query plans for Point and Range queries, Q1 and Q4 respectively. The query plansfor both these classes of queries are same when the cmsketch creation is at the coordinatoror at the datanodes. In this section, we will evaluate the query plans for both scenarios.

The Figure 7.7a illustrates the query plan when the cmkstech creation takes place at thecoordinator. In this case, each datanode is responsible for scanning its local data. Thecoordinator accumulates the data from each datanode and creates the final cmsketch. Oneimportant thing to note is that the data fetched from the datanodes is not sorted. Ag-gregation and sorting of unsorted data at the coordinator adds to the cost of cmksketchcreation.


(a) cmsketch at coordinator

(b) cmsketch at datanodes

Figure 7.7: Comparison of query plan on Two-phased and Three-phased Aggregation modesfor cmsketch (without indexing and clustering).

The query plan in Figure 7.7b depicts the cmsketch creation at the datanodes. The cmsketchcreation at the datanodes uses Hash Aggregate and is sorted before being sent to coordinator.Once these local cmsketches are created, they are sent to the coordinator, which createsthe final cmsketch by combining the local cmsketches. The cmsketch creation is a costlyoperation, especially on large volume of data. In order to make it effective, the cmsketchcreation has to be distributed between the datanodes.

7.7.4 Cost Analysis

Cost Analysis of CM Sketch in a distributed environment can be useful in implementing iton DiceX. We have already discussed the two different aggregation modes of CM Sketch. Inthis section, we will provide a cost analysis of CM Sketch for both aggregation modes. Table7.1 displays the symbols and notations used in this study.


Symbol DefinitionT Relation or Tableα Attribute(s) of Rn Number of datanodesR Results from the datanodes as tuplesρ Results from the datanodes as local cmsketches

Table 7.1: Symbols and Definitions

The relation T is hash distributed on α, with indexing and clustering on the same attribute.The basic operations taken into consideration for cost analysis are given in Table 7.2. Theseoperations govern the cost of cmsketch creation and handling.

Operation Notation DescriptionPredicate θ x = α, where x is the filterSelection σ Select x from TIndex only Scan γ Index only Scan on T (α)Data Transfer ψ Transfer of results from datanodes to the coordinatorParallel Operation

∏(O) Parallel processing of an operation O at datanodes

Aggregation∑

Aggregation of the results at the coordinator

Table 7.2: Basic operations for CM Sketch cost analysis

In Two-phase Aggregation mode, the datanodes transfer the local data to the coordinator.The coordinator merges the local data and creates the cmsketch. The data transfer andmerge operations form the major portion of the cmsketch creation time. Let ψt representthe cost of transfer of tuples from datanodes to the coordinator. The cost of cmsketchcreation in this case can be given as:

C1 = Parallel query processing on datanodes + Data Transfer + Result Aggregation

C1 =n∏i=1

(σθ Ti. γ) + ψt +n∑i=1

Ri

where Ti is the fraction of the table T on the i-th datanode.

Similarly, we evaluate cost of cmsketch creation using Three-phase Aggregation mode. Inthis mode, the cmsketches are created at the datanodes on the local data. These individualcmsketches are then transferred to the coordinator to generate the final cmsketch. The costof final cmsketch creation can be broken down into the following operations:

C2 = cmsketch creation on datanodes + cmsketch Transfer + Result Aggregation

C2 =n∏i=1

(∑σθ Ti. γ) + ψcm +

n∑i=1

ρi


where ρi denotes the cmsketch on table Ti for i-th datanode.

7.8 Data Intensive Computing using DiceX

DiceX is designed to carry out a spectrum of data intensive computations in computationalepidemiology. These include database driven epidemiological simulations, testing and vali-dation of models, and performing analytics on output of population generation and epidemi-ological simulations. DiceX can provide support for Forecasting, Visualization, and Analysison spatio-temporal data.

We have enhanced parallel databases so as to perform analytics on datasets derived fromepidemiological simulations. Typically these are aggregate and filter queries over variousdemographic (age, gender, income etc), spatial (state, block, tract, etc), or temporal (exposedtime) attributes. An example of a typical query is ”number of children infected in past oneweek per county in state of Texas”. Such queries are both compute and data intensive. Theydo not scale well on standalone database. We are using parallel databases to drive suchanalytics. We have build novel enhancements and various optimization to scale processing.The current system has brought query answering time down by an order of magnitude whencompared with time taken to answer by standalone database (see Figure 8.8).

Chapter 8

Experiments and Analysis

8.1 Experimental Setup

8.1.1 Hardware Configurations

• Shadowfax Cluster at VBI-NDSSL : It has 60 compute nodes, each with 12-CoreWestmere-EP X5670 2.93GHz Dual Processor Nodes with 48 GB of Memory.

• Oracle (noldor-db) : 4 Quad Core Xeons @ 2.40 Gh Processor with 64GB RAM

• DLRL Cluster : It has 11 compute nodes with Cloudera Manager

8.1.2 Test Data Setup

We have used two different datasets to evaluate the performance of DiceX framework -Flucaster and Social Contact Network data. In order to perform various experiments, bothof these datasets are available in external datasources - as files and as relational data instandalone databases. We also perform and evaluate a comparative study of DiceX withRDBMS( Oracle) and MapReduce based tool (Hive). The test data setup activities involvespopulating DiceX and Hive with both these datasets.

The data ingestion in DiceX is carried out using the tools and concepts discussed in Section7.4.2. In case of Hive, the traditional COPY and INSERT commands can be used when datais populated from files. However, for loading data from existing standalone datasources,we have used another Hive-based tool called Sqoop [94]. Sqoop connects the source anddestination databases and allows easy and efficient data loading from one datasource toanother.

49

Chapter 8. Experiments and Analysis 50

8.2 Query Plan

It is important to discuss the query plans as when we evaluate a query performance, weare actually evaluating their query plan. Selection of query plan for a query is very vitalfor its performance. The query performance is affected by various database configurationparameters. An efficient query plan is one which provides parallel processing of the costlyoperations. In a distributed database setup, it can be achieved by pushing down the costlyoperations to the datanodes. Indexing and clustering are also very effective query optimiza-tion techniques. We can turn on or off the database configuration parameters during settingup of the database. However, the Query planner may disregard these options if they do notcontribute towards an optimized query plan. The structure of the queries can be taken intoconsideration when deciding the data storage strategies (refer Section 7.4.1).

In this section we will evaluate query plans of three different spatio-temporal queries fromFlucaster - Q6, Q8 and Q9. These queries represent three different class of queries withdifferent spatio-level (see Table 2.2).

(a) with index (b) without index

Figure 8.1: Query Plan for Q6.


The spatio-level of query Q6 is State. In Figure 8.1, we evaluate the query plan of Q6 withand without indexing. In both cases, the coordinator performs the scanning, sorting andaggregation on the Remote Group Query output (which is the result of the remote querieson different datanodes). The query plan for the remote query is different in both scenarios.

In Figure 8.1a, the query plan with indexing inherently uses the index for scanning the tables.Once both tables are scanned, they are joined using Nested Loop operation, which is a costlyoperation. However, when used in tandem with Index Scan, gives better performance. Thequery plan at datanodes without indexing and clustering on the relations is slightly different,refer Figure 8.1. It uses Sequential Scan and Hash Join. Hash Join is faster compared toNested Loop, however, the Sequential Scan adds cost to the query plan.

(a) with index

(b) without index

Figure 8.2: Query Plan for Q8 and Q9.

We further investigate and evaluate the query plans for query Q8 and query Q9, which havespatio-level of county and block (see Table 2.2). The query plans for both Q8 and queryQ9 are similar in structure and operations. The Figure 8.2 illustrates the query plan of Q8and Q9 with and without indexing and clustering on the participating relations or tables.


The operations at coordinator are same as in query Q6. The structure and operations of thequery plan for the remote query at datanodes is different in both cases (with and withoutindexing).

The query plan for the remote query in Figure 8.2a uses Index Scan for one relation andIndex Only Scan for another. Index Only Scan is similar to Index Scan, but is capable ofskipping accessing relations. Once the relations are scanned, Merge Join is used to join theserelations.

The query plan for the remote query at datanodes (without indexing and clustering onrelations) uses Sequential Scan on both relations, refer 8.2. The Merge Join operationcombines two sorted data. The Sequential Scan operations do not produce sorted outputand hence, Sort operations are required before Merge Join operation can be performed. TheSequential Scan operation followed by sorting, adds additional cost to the query plan.

8.3 Query Performance

In this section, we have evaluate the performance of the different class of spatio-temporalqueries. There are various factors that impacts the query performance like setup configura-tions, hardware, database level parameters and others.

8.3.1 Single-compute node versus multi-compute nodes

The Postgres-XC setup in DiceX can be configured on a single compute node or on mul-tiple compute nodes (refer Section 4.1.4). It is important to evaluate and study the queryperformance on both these configurations in order to choose which is suitable for our setup.

We had setup two instances of DiceX - (a) single compute node setup,and (b) multiplecompute node setup. DiceX on single compute node was configured with the coordinatorand 12 datanodes on a single node in the cluster. Similarly, the multiple compute node setupwas configured with 12 datanodes but on different nodes ( 2 datanodes per compute node).In this setup, the coordinator was setup on a different compute node.

In order to compare the performance of the different class of spatio-temporal queries (seeTable 2.2), we executed these queries on both configurations of DiceX. The Figure 8.3 illus-trates the performance of the queries on both setups. It is evident that multi-node setupof DiceX has a comparatively better performance in most cases. However, there are fewqueries, where single-node setups marginally outperforms the multi-node setup. However,multi-compute node configuration offers more flexibility in terms of scalability, consideringthe size of epidemiological data it is expected to manage, store and process.


Figure 8.3: Single Compute node setup Vs Multiple Compute node setup.

8.3.2 Indexing and Clustering

Figure 8.4: Query execution with and without index for different Spatio-Level of State (Q6).

In Section 8.2, we have already evaluated the query plans for queries at different spatio-level. We further explore the impact of indexing on query performance. The default indexing


method used is B-tree. In this section, we will compare the query performance of these spatio-temporal queries on relations with and without indexing and clustering. The experimentswere performed on multi-node DiceX setup with varying number of datanodes ( 2 datanodesper compute node) .

(a) Q8

(b) Q9

Figure 8.5: Query execution with and without index for Spatio-Level - county(Q8) andblock(Q9).


The Figure 8.4 illustrates the query performance of query Q6. We can see from the graphthat the query performance is significantly higher when indexing and clustering is used.Similarly we evaluate the query performance of query Q8 and Q9.

We can see from the Figure 8.5 that the query performance for Q8 and Q9 is very similar.In both cases, the performance of the spatio-temporal queries is comparatively better whenindexing and clustering is used.

Figure 8.6: Speedup for different Spatio-levels.

The Figure 8.6 illustrates the speedup for queries Q6, Q8 and Q9 ( with and without index).We can see that the speedup achieved for Q6 is higher compared to Q8 and Q9 for differentnumber of datanodes. The speedup for Q6 ranges between 2-5.5, which is a very wide range.However, the speedup for Q8 and Q9 is between the range of 2-3.

Another important observation in this study is the effect of data skew on the speedup. Dataskew in parallel databases can have serious consequences on query performance. It leads touneven distribution of data, hence, uneven workloads on datanodes. Presence of negativeskew in the participating relations explains the speedup with 16 datanodes in the Figure 8.6.

This study gives us a good insight on the performance of our data distribution and queryoptimization strategies. Our DiceX setup is reasonably better at state-level compared tocounty and block. This gives us more opportunities and venues to explore.


8.3.3 Evaluation of cmsketch on DiceX

The cmksetch is one of the features that makes the DiceX framework unique. It enables ourframework to support approximation algorithms.

(a) Q1

(b) Q4

Figure 8.7: CM Sketch at coordinator and datanodes for (a) Point Query (b) Range Query.

We have used a Point and Range query from Table 2.2, Q1 and Q4 respectively to evaluate


the performance of cmsketch on DiceX. The cmsketch operation can be executed at thecoordinator or pushed down to the datanodes (Two-phased or Three-phased Aggregationmode). In this study, we have setup both of these configurations on the same instance ofDiceX and evaluated the performance Q1 and Q4.

In Three-phased Aggregation Mode, scanning and aggregation operations are performed atthe datanode-level. The cost of scanning and aggregation is high but these operations areexecuted in parallel. Also, the final aggregation at the coordinator is cheaper as it performsaggregation on the cmsketches returned from the datanodes. The Two-phased Aggregationmode uses only scanning operations at the datanodes, which returns the relevant tuples tothe coordinator. The aggregation of tuples from datandodes at the coordinator degrades theoverall performance of the Two-phased Aggregation mode.

The Figure 8.7 illustrates the query performance of Q1 and Q4 with cmsketch on DiceX .It clearly shows that cmsketch creation at the datanodes (Three-phased Aggregation mode)is more effective than creating at the coordinator (Two-phased Aggregation mode). Thecmsketch creation in Three-phased Aggregation mode can be further optimized by evaluatingthe dispensability of the aggregation operation at the datanodes.

Another important observation from these experiments is that the performance of these classof queries (Q1 and Q4) with cmsketch is significantly costly compared to their respectivenon-cmsketch versions. More research and experiments needs to be done to optimize theirperformance.

8.3.4 Performance comparison with RDBMS and MapReduce basedtools

One of the primary goals of our experiments was to compare DiceX with existing state-of-the-art data management and processing engines. We have compared the performance ofthe spatio-temporal queries on epidemilogical datasets on DiceX (comprising of distributeddatabase engines), Oracle (standalone RDMS) and Hive ( MapReduce based tool built onhadoop). The Figure 8.8 illustrates the comparison of the aforementioned tools and systems.

From the Figure 8.8, we can see that the DiceX framework is able to effectively exploitsupercomputing resources to scale query performance by one to two orders of magnitudeover standalone enterprise database as well as MapReduce based tools. The performance ofstandalone databases was at par with that of hadoop-based tools. Standalone databases arenot equipped with parallel processing, which is reason enough for their poor performancecompared to DiceX. However, Hive provides parallel processing using Map and Reduce oper-ators. The query processing in Hive was based on the conversion of the query into a sequenceof Map a Reduce jobs, executed using Hadoop. Hadoop uses iterator based models for pro-cessing the data compared to accumulator-based approach used in parallel databases. Yu.et al [101] points out that accumulator-based models performs better than iterator-based


Figure 8.8: Performance of Spatio-temporal queries using DiceX framework compared toOracle (standalone database) and Hive (MapReduce). The Y-Axis is in Log-Scale

models and hence have been adopted by databases.

Our goal to design DiceX was to provide a real-time spatio-temporal analytics. The responsetime of the different class of queries from Flucaster was less than 10 seconds, which is nearreal-time interaction (considering that the real-time interaction is with response time < 3seconds). The performance of DiceX looks promising and capable of scaling further andproviding real-time query processing.

Chapter 9

Future Works and Conclusion

Given the trends in computational science, the resource requirements, in near future, fordata-intensive computing can only be met by supercomputing resources. The thesis listsvarious forms of data and computation that surface in computational sciences and how theystress on various aspects of current supercomputing system. We see the current design ofsupercomputer and the supporting toolset are not conducive for data-intensive sciences, likecomputational epidemiology.

Central to our approach is the use of (distributed) databases which allow us to effectivelyexploit supercomputing resources (compute nodes, multi-cores, large main memory) to scaledata-intensive computation by a factor of 10–100. Some scientific computation are onlyfeasible using the distributed processing. The framework itself is designed to be generic,i.e., it can handle several class of computations such as domain specific scientific analysis(age distribution tests), relational analytic processing (OLAP), interventions, and, searchand navigation over several class of data source including files, databases, and, output fromcomputations. The framework is user friendly and handles instantiation, ingestion, andpersistence of data and other low level bookkeeping and organization details. The end useronly needs to specify the computation (in terms of SQL queries) and the input dataset.The framework builds up a data intensive run-time (on the supercomputer) to perform thecomputation and retrieve the results. DiceX also comes with some deficiencies associatedwith Postgres-XC. For example, the coordinator interacts with the datanodes one-at-a-time.However, it is still give better performance compared to MapReduce for warehouse-styleanalytics as long as each individual record size is small. It is significantly better comparedto standalone databases and can provide good support for visualization.

DiceX provides more opportunities and scope for improvements and enhancements. Thereis a high likelihood that the size of the epidemiological data is not fixed if they are generatedusing some simulations. Thus, it becomes difficult to manage and allocate resources. Dy-namic resource (in terms of nodes) addition and removal on a running instance of DiceX willbe an interesting and useful feature to incorporate. We have implemented cmsketch module

59

Chapter 9. Future Works and Conclusion 60

from MADLib project on DiceX. The MADLib project has a rich set of analytical toolsand algorithms which can be useful for Spatio-temporal studies and can be implemented onDiceX.

Bibliography

[1] Apache cassandra. http://en.wikipedia.org/wiki/Apache_Cassandra. Accessed:2014-08-23.

[2] Clusterpoint. http://en.wikipedia.org/wiki/Clusterpoint. Accessed: 2014-08-23.

[3] Druid. http://en.wikipedia.org/wiki/Druid_(open-source_data_store). Ac-cessed: 2014-08-23.

[4] Ebola virus disease. http://www.who.int/mediacentre/factsheets/fs103/en/.Accessed: 2014-12-02.

[5] Foundationdb. http://en.wikipedia.org/wiki/FoundationDB. Accessed: 2014-08-23.

[6] H1n1 flu. http://www.cdc.gov/h1n1flu/cdcresponse.htm. Accessed: 2014-12-02.

[7] Hadoop. http://hadoop.apache.org. Accessed: 2014-08-23.

[8] Hive. https://cwiki.apache.org/confluence/display/Hive/Home. Accessed:2014-08-23.

[9] Nuodb. http://en.wikipedia.org/wiki/NuoDB. Accessed: 2014-08-23.

[10] Postgres-xc. https://wiki.postgresql.org/wiki/Postgres-XC. Accessed: 2014-08-23.

[11] Riak. http://en.wikipedia.org/wiki/Riak. Accessed: 2014-08-23.

[12] Tandem computers. http://en.wikipedia.org/wiki/Tandem_Computers. Accessed:2014-08-23.

[13] Teradata. www.teradata.com. Accessed: 2014-08-23.

[14] Vertica. http://en.wikipedia.org/wiki/Vertica. Accessed: 2014-08-23.

61

Bibliography 62

[15] World population. http://www.worldometers.info/world-population/. Accessed:2014-11-12.

[16] Daniel J Abadi, Peter A Boncz, and Stavros Harizopoulos. Column-oriented databasesystems. Proceedings of the VLDB Endowment, 2(2):1664–1665, 2009.

[17] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, andAlexander Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms tech-nologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922–933,2009.

[18] Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang,and Joel Saltz. Hadoop gis: a high performance spatial data warehousing system overmapreduce. Proceedings of the VLDB Endowment, 6(11):1009–1020, 2013.

[19] Andrea Apolloni, VS Anil Kumar, Madhav V Marathe, and Samarth Swarup. Com-putational epidemiology in a connected world. Computer, 42(12):0083–86, 2009.

[20] Shivnath Babu and Herodotos Herodotou. Massively parallel databases and mapreducesystems. Foundations and Trends in Databases, 5(1):1–104, 2013.

[21] Koichi Suzuki Bapat, Ashutosh and Michael Paquier. Configuring writescalable post-gresql cluster : Postgres-xc primer and more. PGCon 2012 - PostgreSQL Conferencefor Users and Developers, University of Ottawa, Ottawa, 2012.

[22] Christopher L Barrett, Keith R Bisset, Stephen G Eubank, Xizhou Feng, and Mad-hav V Marathe. Episimdemics: an efficient algorithm for simulating the spread of infec-tious disease over large realistic social networks. In Proceedings of the 2008 ACM/IEEEconference on Supercomputing, page 37. IEEE Press, 2008.

[23] Rachel E Behrman, Joshua S Benner, Jeffrey S Brown, Mark McClellan, Janet Wood-cock, and Richard Platt. Developing the sentinel systema national resource for evidencedevelopment. New England Journal of Medicine, 364(6):498–499, 2011.

[24] Keith R Bisset, Jiangzhuo Chen, Xizhou Feng, VS Kumar, and Madhav V Marathe.Epifast: a fast algorithm for large scale realistic epidemic simulations on distributedmemory systems. In Proceedings of the 23rd international conference on Supercomput-ing, pages 430–439. ACM, 2009.

[25] Keith R Bisset, Jiangzhuo Chen, Xizhou Feng, Yifei Ma, and Madhav V Marathe.Indemics: an interactive data intensive framework for high performance epidemic sim-ulation. In Proceedings of the 24th ACM International Conference on Supercomputing,pages 233–242. ACM, 2010.

Bibliography 63

[26] Haran Boral, William Alexander, Larry Clay, George Copeland, Scott Danforth,Michael Franklin, Brian Hart, Marc Smith, and Patrick Valduriez. Prototyping bubba,a highly parallel database system. Knowledge and Data Engineering, IEEE Transac-tions on, 2(1):4–24, 1990.

[27] Dhruba Borthakur. Hdfs architecture guide. HADOOP APACHE PROJECThttp://hadoop. apache. org/common/docs/current/hdfs design. pdf, 2008.

[28] Clark Bradley, Ralph Hollinshead, Scott Kraus, Jason Lefler, and Roshan Taheri. Datamodeling considerations in hadoop and hive. 2013.

[29] Arnaud Chiolero. Big data in epidemiology: too big to fail? Epidemiology, 24(6):938–939, 2013.

[30] Inc. Cloudera. Cloudera manager. http://www.cloudera.com/content/cloudera/

en/products-and-services/cloudera-enterprise/cloudera-manager.html. Ac-cessed: 2014-08-23.

[31] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M Hellerstein, and Caleb Welton.Mad skills: new analysis practices for big data. Proceedings of the VLDB Endowment,2(2):1481–1492, 2009.

[32] Neil Conway. Query execution techniques in postgresql. 2007.

[33] Graham Cormode and Marios Hadjieleftheriou. Finding the frequent items in streamsof data. Communications of the ACM, 52(10):97–105, 2009.

[34] Graham Cormode and S Muthukrishnan. An improved data stream summary: thecount-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.

[35] Graham Cormode and S Muthukrishnan. Approximating data with the count-minsketch. IEEE software, 29(1):0064–69, 2012.

[36] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on largeclusters. Communications of the ACM, 51(1):107–113, 2008.

[37] Ugur Demiryurek, Farnoush Banaei-Kashani, and Cyrus Shahabi. Transdec: A spa-tiotemporal query processing framework for transportation systems. In Data Engineer-ing (ICDE), 2010 IEEE 26th International Conference on, pages 1197–1200. IEEE,2010.

[38] David DeWitt and Jim Gray. Parallel database systems: the future of high performancedatabase systems. Communications of the ACM, 35(6):85–98, 1992.

[39] David J DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker,Hui-I Hsiao, and Rick Rasmussen. The gamma database machine project. Knowledgeand Data Engineering, IEEE Transactions on, 2(1):44–62, 1990.

Bibliography 64

[40] Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty,and Jorg Schad. Hadoop++: Making a yellow elephant run like a cheetah (without iteven noticing). Proceedings of the VLDB Endowment, 3(1-2):515–529, 2010.

[41] Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal,and Jorg Schad. Only aggressive elephants are fast elephants. Proceedings of the VLDBEndowment, 5(11):1591–1602, 2012.

[42] Christos Doulkeridis and Kjetil Nørvag. A survey of large-scale analytical query pro-cessing in mapreduce. The VLDB Journal, 23(3):355–380, 2014.

[43] Todd Eavis and Ahmad Taleb. Query optimization and execution in a parallel analyticsdbms. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26thInternational, pages 897–908. IEEE, 2012.

[44] Ahmed Eldawy and Mohamed F Mokbel. A demonstration of spatialhadoop: anefficient mapreduce framework for spatial data. Proceedings of the VLDB Endowment,6(12):1230–1233, 2013.

[45] Jacques Esteve, Ellen Benhamou, and Luc Raymond. Statistical methods in cancerresearch. volume iv. descriptive epidemiology. IARC Sci Publ, 128:13, 1994.

[46] Lars George. HBase: the definitive guide. ” O’Reilly Media, Inc.”, 2011.

[47] Goetz Graefe. Volcano-an extensible and parallel query evaluation system. Knowledgeand Data Engineering, IEEE Transactions on, 6(1):120–135, 1994.

[48] Postgres-XC Development Group. User defined aggregates. http://postgres-xc.

sourceforge.net/docs/1_0/xaggr.html. Accessed: 2014-08-23.

[49] Postgres-XC Development Group. User defined functions. http://postgres-xc.

sourceforge.net/docs/1_0/xfunc.html. Accessed: 2014-08-23.

[50] Tony H Grubesic and Elizabeth A Mack. Spatio-temporal interaction of urban crime.Journal of Quantitative Criminology, 24(3):285–306, 2008.

[51] Anja Gruenheid, Edward Omiecinski, and Leo Mark. Query optimization using columnstatistics in hive. In Proceedings of the 15th Symposium on International DatabaseEngineering & Applications, pages 97–105. ACM, 2011.

[52] Ragini Gupta. A Framework for Data Quality for Synthetic Information. PhD thesis,Virginia Polytechnic Institute and State University, 2014.

[53] Christer A Hansen. Optimizing hadoop for the cluster. Institue for Computer Science,University of Troms0, Norway, http://oss. csie. fju. edu. tw/˜ tzu98/Optimizing%20Hadoop% 20for% 20the% 20cluster. pdf, Retrieved online October, 2012.

Bibliography 65

[54] Joseph M Hellerstein, Christoper Re, Florian Schoppmann, Daisy Zhe Wang, EugeneFratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li,et al. The madlib analytics library: or mad skills, the sql. Proceedings of the VLDBEndowment, 5(12):1700–1711, 2012.

[55] Bernd Helmle. Writing a foreign data wrapper. PGConf.EU 2013 - PostgreSQL Con-ference Europe 2013, Credativ GmbH, Dublin (Ireland), 2013.

[56] Miguel A Hernan and David A Savitz. From” big epidemiology” to” colossal epidemi-ology”: when all eggs are in one basket. Epidemiology, 24(3):344–345, 2013.

[57] Eben Hewitt. Cassandra: the definitive guide. ” O’Reilly Media, Inc.”, 2010.

[58] Kirak Hong, Beate Ottenwalder, and Umakishore Ramachandran. Scalable spatio-temporal analysis on distributed camera networks. In Intelligent Distributed ComputingVII, pages 131–140. Springer, 2014.

[59] S Idreos and S Madden. The design and implementation of modern column-orienteddatabase systems.

[60] Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, Sjoerd Mullender, andMartin Kersten. Monetdb: Two decades of research in column-oriented database ar-chitectures. Bulletin of the IEEE Computer Society Technical Committee on DataEngineering, 35(1):40–45, 2012.

[61] Ming-Yee Iu and Willy Zwaenepoel. Hadooptosql: a mapreduce query optimizer.In Proceedings of the 5th European conference on Computer systems, pages 251–264.ACM, 2010.

[62] Ognjen V Joldzic and Dijana R Vukovic. The impact of cluster characteristics on hiveqlquery optimization. In Telecommunications Forum (TELFOR), 2013 21st, pages 837–840. IEEE, 2013.

[63] M Farrukh Khan, Ray Paul, Ishfaq Ahmed, and Arif Ghafoor. Intensive data manage-ment in parallel systems: A survey. Distributed and Parallel Databases, 7(4):383–414,1999.

[64] Amit Khandekar. A shared-nothing cluster system: Postgres-xc. PGOpen 2012 -PostgreSQL Conference for Users and Developers, University of Ottawa, Chicago, 2012.

[65] Sriram Krishnan, Mahidhar Tatineni, and Chaitanya Baru. myhadoop-hadoop-on-demand on traditional hpc resources. San Diego Supercomputer Center Technical Re-port TR-2011-2, University of California, San Diego, 2011.

[66] Chuck Lam. Hadoop in action. Manning Publications Co., 2010.

Bibliography 66

[67] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon.Parallel data processing with mapreduce: a survey. AcM sIGMoD Record, 40(4):11–20,2012.

[68] Steve Lohr. The age of big data. New York Times, 11, 2012.

[69] Nikos Mamoulis, Huiping Cao, George Kollios, Marios Hadjieleftheriou, Yufei Tao,and David W Cheung. Mining, indexing, and querying historical spatiotemporal data.In Proceedings of the tenth ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 236–245. ACM, 2004.

[70] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, CharlesRoxburgh, and Angela H Byers. Big data: The next frontier for innovation, competi-tion, and productivity. 2011.

[71] Madhav Marathe and Anil Kumar S Vullikanti. Computational epidemiology. Com-munications of the ACM, 56(7):88–96, 2013.

[72] Andrew McAfee, Erik Brynjolfsson, Thomas H Davenport, DJ Patil, and DominicBarton. Big data. The management revolution. Harvard Bus Rev, 90(10):61–67, 2012.

[73] Jaymie R Meliker and Chantel D Sloan. Spatio-temporal epidemiology: principles andopportunities. Spatial and spatio-temporal epidemiology, 2(1):1–9, 2011.

[74] Jens Meyer, Stefan Ostrzinski, Daniel Fredrich, Christoph Havemann, JaninaKrafczyk, and Wolfgang Hoffmann. Efficient data management in a large-scale epidemi-ology research project. Computer methods and programs in biomedicine, 107(3):425–435, 2012.

[75] Bruce Momjian. PostgreSQL: introduction and concepts, volume 192. Addison-WesleyNew York, 2001.

[76] Raghotham Murthy and Rajat Goel. Peregrine: Low-latency queries on hive warehousedata. XRDS: Crossroads, The ACM Magazine for Students, 19(1):40–43, 2012.

[77] Virginia Bioinformatics Institute NDSSL. Flucaster. http://ndssl.vbi.vt.edu/

appsv2.1/flucaster/index.html. Accessed: 2014-08-23.

[78] EnterpriseDB Corporation NTT Open Source Software Center. Postgres-xc ar-chitecture, implementation and evaluation, version 0.900. http://postgres-xc.

sourceforge.net/misc-docs/PG-XC_Architecture.pdf, 2010. Accessed: 2014-08-23.

[79] M Tamer Ozsu and Patrick Valduriez. Principles of distributed database systems.Springer, 2011.

Bibliography 67

[80] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J DeWitt,Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scaledata analysis. In Proceedings of the 2009 ACM SIGMOD International Conference onManagement of data, pages 165–178. ACM, 2009.

[81] Liliya Rudko. Column-oriented database systems.

[82] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. Thehadoop distributed file system. In Mass Storage Systems and Technologies (MSST),2010 IEEE 26th Symposium on, pages 1–10. IEEE, 2010.

[83] Manoj Kumar Singh and Parveen Kumar. Hadoop: A big data management frameworkfor storage, scalability, complexity, distributed files and processing of massive datasets.

[84] Leonid B Sokolinsky. Survey of architectures of parallel database systems. Program-ming and Computer Software, 30(6):337–346, 2004.

[85] Michael Stonebraker, Randy H Katz, David A Patterson, and John K Ousterhout.The design of xprs. In VLDB, pages 318–330, 1988.

[86] Michael Stonebraker and Greg Kemnitz. The postgres next generation database man-agement system. Communications of the ACM, 34(10):78–92, 1991.

[87] Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, et al.C-store: a column-oriented dbms. In Proceedings of the 31st international conferenceon Very large data bases, pages 553–564. VLDB Endowment, 2005.

[88] Haoyu Tan. A hadoop-based storage system for big spatio-temporal data analytics.2012.

[89] Haoyu Tan, Wuman Luo, and Lionel M Ni. Clost: a hadoop-based storage systemfor big spatio-temporal data analytics. In Proceedings of the 21st ACM internationalconference on Information and knowledge management, pages 2139–2143. ACM, 2012.

[90] CORPORATE Tandem Performance Group. A benchmark of non-stop sql on the debitcredit transaction. In Proceedings of the International Conference on Management ofData, pages 337–341, 1988.

[91] Ronald C Taylor. An overview of the hadoop/mapreduce/hbase framework and itscurrent applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010.

[92] Gautam Thakur, Yibin Wang, and Kun Li. Fastmod: A framework for realtime spatio-temporal monitoring of mobility data. Department of Computer & Information Science& Engineering, University of Florida, Tech. Rep, 2011.

Bibliography 68

[93] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, NingZhang, Suresh Antony, Hao Liu, and Raghotham Murthy. Hive-a petabyte scale datawarehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th InternationalConference on, pages 996–1005. IEEE, 2010.

[94] Kathleen Ting and Jarek Jarcec Cecho. Apache Sqoop Cookbook. ” O’Reilly Media,Inc.”, 2013.

[95] Sengwee Toh and Richard Platt. Is size the next big thing in epidemiology? Epidemi-ology, 24(3):349–351, 2013.

[96] NDSSL, Virginia Bioinformatics Institute. User manual for didactic, version 1.3. http://ndssl.vbi.vt.edu/didactic/DidacticUserManual.pdf, 2009. Accessed: 2014-08-23.

[97] Apichon Witayangkurn, Teerayut Horanont, and Ryosuke Shibasaki. The design oflarge scale data management for spatial analysis on mobile phone dataset. AsianJournal of Geoinformatics, 13(3), 2013.

[98] Petrie Wong, Zhian He, and Eric Lo. Parallel analytics as a service. In Proceedings ofthe 2013 international conference on Management of data, pages 25–36. ACM, 2013.

[99] Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. Query optimization formassively parallel data processing. In Proceedings of the 2nd ACM Symposium onCloud Computing, page 12. ACM, 2011.

[100] Ying Yan, Liang Jeff Chen, and Zheng Zhang. Error-bounded sampling for analyticson big sparse data. Proceedings of the VLDB Endowment, 7(13), 2014.

[101] Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. Distributed aggregation fordata-parallel computing: interfaces and implementations. In Proceedings of the ACMSIGOPS 22nd symposium on Operating systems principles, pages 247–260. ACM, 2009.

[102] Mingxuan Yuan, Ke Deng, Jia Zeng, Yanhua Li, Bing Ni, Xiuqiang He, Fei Wang,Wenyuan Dai, and Qiang Yang. Oceanst: a distributed analytic system for large-scalespatiotemporal mobile broadband data. Proceedings of the VLDB Endowment, 7(13),2014.

[103] Yunqin Zhong, Xiaomin Zhu, and Jinyun Fang. Elastic and effective spatio-temporalquery processing scheme on hadoop. In Proceedings of the 1st ACM SIGSPATIALInternational Workshop on Analytics for Big Geospatial Data, pages 33–42. ACM,2012.

[104] Marcin Zukowski, Mark van de Wiel, and Peter Boncz. Vectorwise: A vectorized ana-lytical dbms. In Data Engineering (ICDE), 2012 IEEE 28th International Conferenceon, pages 1349–1350. IEEE, 2012.

Appendix A

Notations

• Table names in mathcal

– Demography Info : DI– SES : SES

• column names as mathpzc fonts

– age : a

– gender : g

– exposed time : e

– county id : c

– pid : p

– blockgroupid : b

• Relational operators either as latex, or, greek symbols or mathtt fonts

– Remote Query rq

– Select σ

– filter F

– Project Π

– Join ./

– Backward Index Scan BIS

– Index Only Scan IoS

– Index Scan IS

– Sequential Scan SS

69

Appendix 70

– Hash Join ./H

– Nested Loop join ./L

– Merge Join ./M

– Nested Loop NL

– Sort N

– Condition JK

– Order By O

– Group By Γ

– Group Aggregate ΓG

– Hash Aggregate ΓH

– Sorted Aggregate ΓS

– Composition ◦

Appendix B

Table Schema

– SES XX SESID

PID int

REP int

EXPOSED_TIME int

INFECTIOUS_TIME int

RECOVERED_TIME int

– XX DEMOGRAPHY INFO

PID int

HID int

AGE int

GENDER int

ZIPODE varchar

BLOCKGROUPID varchar

LONGITUDE varchar

LATTITUDE varchar

COUNTY varchar

COUNTYID varchar

71

Documents

E cient Spatio-Temporal Network Analytics In ... · spatio-temporal data1 comes from varied sources. Data size has become an integral feature of large-scale epidemiological studies,