11
Spring 2015 CMPE 226 – Database Systems Big Data in Health Care Project Report – Team 3 Anushree Sonni 009400534 [email protected] Yeshwanth Ravindra 009318400 [email protected] Karthik Kolichelimi Venkatrao 009299693 [email protected] Department of Computer Engineering, San José State University 1 Washington Sq, San Jose, CA 95192 Abstract — In a medical emergency, we hear people being unsure about which medical center to choose, not knowing the nearest center offering the required service, unaware about which among the centers in their vicinity provide the best diagnostic service at reasonable fee. All these questions and more have been answered by our application which proves to be a handy solution to people in need. In our project, we focus majorly on how we analyze one big piece of medical data into various possible statistics combining many different frameworks and tools to achieve the required statistical results. This medical big data which has a wealth of information concealed, when harnessed appropriately can provide meaningful, valuable and helpful statistics. I. INTRODUCTION In today’s ever expanding world, every individual has to have the knowledge about various sprouting medical advancements and technologies and the offerings in the medical centers around them. Despite the fact that a couple of websites are available for referencing medical data, there is a strong need for a dedicated application to mine the large data sets of these sites. This could be achieved of course by regularly keeping up the latest news but yet it is good to have a tool which provides you with every possible analysis and information about the medical centers, a button-click away. The data provided here under the Data Input includes hospital-specific charges for more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using Medicare Severity Diagnosis Related Group (MS- DRG). The project focuses on various factors and mainly on three different types of users, the Government, the Public and the Hospital. The Government can see the statistics of the charges and the tax reimbursement to be issued. The Public can choose the DRG based on the cost, location and the service provided. The Hospital administration can check the financial statement of the DRGs to monitor their income and finances. This also provides valuable information on its competitors. The intent of the project is to analyze this unstructured medical data which can be analyzed based on Provider’s Name, Address, City, State and can be referenced using the Zip code of a particular area. This analysis will be helpful in finding out: 1. Average costs to be incurred per hospital within a particular state or region so that patients can look for an affordable hospital when needed. 2. Number of patients within a hospital that are being cured for a similar disease. 3. It can also find out how widespread is a disease and in which area or region. The outcome of the analysis is represented visually via user-friendly interactive Dashboard. It will also explain the analysis via Drill-down and Drill-through. Through this analysis a user can easily find out which is the most affordable hospital for him categorized on the region or state. It will be very simple for him to find out a particular widespread disease he should be alert of in his area. And he can check for the Popularity of a hospital based on the number of discharges. II. TOOLS/SOFTWARE USED IN THE PROJECT For the successful implementation of the project, Tools and Approaches are needed to do it. Project Implementation phase involves: 1. Project Initialization: This involves getting the raw file in the form CSV from the Data.CMS.gov site. Then once the data is received cleaning action is performed on the sheet.

project report.pdf

  • Upload
    bingo

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: project report.pdf

Spring 2015

CMPE 226 – Database Systems

Big Data in Health Care Project Report – Team 3

Anushree Sonni 009400534

[email protected]

Yeshwanth Ravindra 009318400

[email protected]

Karthik Kolichelimi Venkatrao 009299693

[email protected]

Department of Computer Engineering, San José State University

1 Washington Sq, San Jose, CA 95192

Abstract — In a medical emergency, we hear people being

unsure about which medical center to choose, not knowing the

nearest center offering the required service, unaware about

which among the centers in their vicinity provide the best

diagnostic service at reasonable fee. All these questions and more

have been answered by our application which proves to be a

handy solution to people in need. In our project, we focus

majorly on how we analyze one big piece of medical data into

various possible statistics combining many different frameworks

and tools to achieve the required statistical results. This medical

big data which has a wealth of information concealed, when

harnessed appropriately can provide meaningful, valuable and

helpful statistics.

I. INTRODUCTION

In today’s ever expanding world, every individual has to

have the knowledge about various sprouting medical

advancements and technologies and the offerings in the

medical centers around them. Despite the fact that a couple of

websites are available for referencing medical data, there is a

strong need for a dedicated application to mine the large data

sets of these sites. This could be achieved of course by

regularly keeping up the latest news but yet it is good to have

a tool which provides you with every possible analysis and

information about the medical centers, a button-click away.

The data provided here under the Data Input includes

hospital-specific charges for more than 3,000 U.S. hospitals

that receive Medicare Inpatient Prospective Payment System

(IPPS) payments for the top 100 most frequently billed

discharges, paid under Medicare based on a rate per discharge

using Medicare Severity Diagnosis Related Group (MS-

DRG).

The project focuses on various factors and mainly on three

different types of users, the Government, the Public and the

Hospital. The Government can see the statistics of the charges

and the tax reimbursement to be issued. The Public can

choose the DRG based on the cost, location and the service

provided. The Hospital administration can check the financial

statement of the DRGs to monitor their income and finances.

This also provides valuable information on its competitors.

The intent of the project is to analyze this unstructured

medical data which can be analyzed based on Provider’s

Name, Address, City, State and can be referenced using the

Zip code of a particular area.

This analysis will be helpful in finding out:

1. Average costs to be incurred per hospital within a particular

state or region so that patients can look for an affordable

hospital when needed.

2. Number of patients within a hospital that are being cured

for a similar disease.

3. It can also find out how widespread is a disease and in

which area or region.

The outcome of the analysis is represented visually via

user-friendly interactive Dashboard. It will also explain the

analysis via Drill-down and Drill-through.

Through this analysis a user can easily find out which is

the most affordable hospital for him categorized on the region

or state. It will be very simple for him to find out a particular

widespread disease he should be alert of in his area. And he

can check for the Popularity of a hospital based on the number

of discharges.

II. TOOLS/SOFTWARE USED IN THE PROJECT

For the successful implementation of the project, Tools and

Approaches are needed to do it.

Project Implementation phase involves:

1. Project Initialization:

This involves getting the raw file in the form CSV from the

Data.CMS.gov site. Then once the data is received cleaning

action is performed on the sheet.

Page 2: project report.pdf

Spring 2015

Figure 1: End to End Flow

1 .1 System Resources:

1.1.1 Minimum resources: While there are no guarantees on the minimum resources

required by Hadoop daemons, the community attempts to not

increase requirements within a minor release.

We can use GNU/Linux, Microsoft Windows, Apple

MacOSX, Solaris etc. operating systems where Apache

Hadoop is known to work reasonably well.

1.1.2 Framework: Apache Hadoop Apache Hadoop is an open source software framework and

distributed processing of Big data on clusters of commodity

hardware. Its Hadoop Distributed File System (HDFS) splits

files into large blocks (default 64MB or 128MB) and

distributes the blocks amongst the nodes in the cluster. For

processing the data, Hadoop Map/Reduce ships data to the

nodes that have the required data, and the nodes then process

the data in parallel. This approach leverages data locality. The

cleansed DGV file is transferred to Hadoop file system.

1.1.3 Database: HIVE The Apache Hive ™ data warehouse software facilitates

querying and managing large datasets residing in distributed

storage. Hive provides a mechanism to project structure onto

this data and query the data using a SQL-like language called

HiveQL. At the same time this language also allows

traditional map/reduce programmers to plug in their custom

mappers and reducers when it is inconvenient or inefficient to

express this logic in HiveQL.

1.1.4. Technologies Used: Languages: R R is a free software programming language. It is widely

used among statisticians and data miners. R is an implementation of S programming language

combined with lexical scoping semantics. S was created by

John Chambers while at Bell Labs. R is used to categorizing the DRG in three level HIGH,

MEDIUM, LOW based on the expense. Some aggregation

features of R like MIN, MAX, MEAN and summary are

exploited.

.

Figure 2: R function

Tableau is business intelligence software that allows

anyone to easily connect to data, then visualize and create

interactive, sharable dashboards. It's easy enough that any

Excel user can learn it, but powerful enough to satisfy even

the most complex analytical problems. Securely sharing your

findings with others only takes seconds.

The result is BI software that you can trust to actually

deliver answers to the people that need them.

Today's organizations need efficient, scalable and easily

deployable business intelligence tools in order to accomplish

their goals. All too often, on boarding a new BI tool is an

effort of weeks, months or even years - and maintenance is

punctuated by a never ending stream of user requests and

expensive consulting bills.

Tableau takes a different approach. Installing Tableau

Desktop takes minutes. Once it's installed, anyone can connect

to data with a click and create interactive, analytical

dashboards. Sharing dashboards is just as easy: simply publish

them to either Tableau Server (on-premise), or

Tableau (Tableau Server in the cloud). Even large enterprise

deployments can be achieved with ease using Tableau's Drive

methodology.

Page 3: project report.pdf

Spring 2015

The dashboards are published to tableau public server, so

that anyone who’s in need of information can easily access it.

1.2 Hardware 1.2.1 Architecture:

Figure 3: Hadoop Master Slave Architecture

In Hadoop every system is categorized into mainly main

system Slave node, Master node and Client machines.

Slave Node:

Every system in Hadoop contains consists of Task tracker

and Data node. The job of Task tracker is to process the small

pieces of task given to the node while the Data node manages

the data given to the node. As the requirement of the data

increases more system with this pattern will added thus

forming a cluster.

Server Node

The Server contains Job tracker and Name node. The task

of the job tracker is to accept the task from the client and then

break the task given into smaller pieces and assign the task to

Slave nodes. The Name node keeps track of where data in

located Data node. Whenever the Client wants to write or read

a file it talks to Data node to find out the location. Both the

components Job tracker and Name node is responsible to

detect if there is any failure in Task tracker and Data node

respectively.

The Task and Job tracker forms the Map reduce

component while the Name and Data node forms Hadoop

distributed file system (HDFS) component of the Hadoop.

Client Machine

The task of the Client machine is to describe how the data

has to be processed (Map reduce), load the data to HDFS and

then get the results from the job when it’s done.

Figure 4: HDFS Architecture

HDFS is designed to work on commodity hardware such

as Personal Computers however Hadoop is mainly ran on

servers. HDFS works well with large data by offering fast

access to the applications data. It’s also very reliable as it is

highly fault tolerant.

Only one Namenode is in an HDFS cluster. The

Namenode is the master server in the cluster and it is in charge

of the file system namespace and controls file accessibility by

the clients. Closing, opening or renaming a file is a duty

executed by the Namenode as well as assigning blocks to

Datanodes.

The number of Datanodes is defined by the number of

nodes in the cluster. Datanodes manage data of the node to

which they are attached to. User data is stored in files, as

assigned by HDFS, that are fragmented into one or more

blocks which are then stored into a Datanode. When a read or

write requests is received from a client the Datanode takes

charge of the operation. A Datanode can also create, delete,

and replicate blocks if instructed by the Namenode.

Namenode and Datanode is software designed to work on

commodity machines. HDFS is designed and written in Java

so any machine that can run Java can utilize Namenode and

Datanode software.

The file system is designed in a similar fashion to other

file systems in existence. The Namenode records any changes

to the file system and keeps the number of times a file has

been copied. HDFS maintains the number of times a file is

replicated according to a number specified by the application.

HDFS replicates files in order to maintain high fault

tolerance. The files are stored in a sequence of blocks. The

block size and number of times a file can be replicated is

stated by the application. Replication of blocks is handled by

the Namenode.

MapReduce is a software framework that assists in writing

of programs that handle large amounts of data across

thousands of nodes4. It is divided into two parts: map and

reduce. The Map part distributes the work that needs to be

processed into separate nodes. The reduce part takes the

output of the Map phase and produces and single output.

Page 4: project report.pdf

Spring 2015

Pairing of MapReduce with HDFS works well because HDFS

provides high bandwidth across a large cluster.

SQL Server Reporting Services is a server-based reporting

platform that provides comprehensive reporting functionality

for a variety of data sources. Reporting Services includes a

complete set of tools to create, manage, and deliver reports,

and APIs that enable developers to integrate or extend data

and report processing in custom applications. Reporting

Services tools work within the Microsoft Visual Studio

environment and are fully integrated with SQL Server tools

and components.

HIVE :

The main components of Hive are:

a) External Interfaces - Hive provides both user interfaces

like command line (CLI) and web UI, and application

programming interfaces (API) like JDBC and

ODBC.

b) The Hive Thrift Server exposes a very simple client API to

execute HiveQL statements. Thrift is a framework for cross-

language services, where a server written in one language

(like Java) can also support clients in other languages. The

Thrift Hive clients generated in di_erent languages are used to

build common drivers like JDBC (java), ODBC (C++), and

scripting drivers written in php, perl, python etc.

c) The Metastore is the system catalog. All other components

of Hive interact with the metastore.

d) The Driver manages the life cycle of a HiveQL statement

during compilation, optimization and execution. On receiving

the HiveQL statement, from the thrift server or other

interfaces, it creates a session handle which is later used to

keep track of statistics like execution time, number of output

rows, etc.

e) The Compiler is invoked by the driver upon receiving

a HiveQL statement. The compiler translates this statement

into a plan which consists of a DAG of map reduce jobs.

f) The driver submits the individual map-reduce jobs from

the DAG to the Execution Engine in a topological order. Hive

currently uses Hadoop as its execution engine.

g) Database - is a namespace for tables. The database default

is used for tables with no user supplied database name.

h) Table - Metadata for table contains list of columns and

their types, owner, storage and SerDe information. It can also

contain any user supplied key and value data; this facility can

be used to store table statistics in the future. Storage

information includes location of the table's data in the

underlying _le system, data formats and bucketing

information. SerDe metadata includes the implementation

class of serializer and deserializer methods and any supporting

information required by that implementation. All this

information can be provided during the creation of table.

i)Partition - Each partition can have its own columns and

SerDe and storage information. This can be used in the future

to support schema evolution in a Hive

Figure 5: HIVE Architecture Diagram

1.3 System Configuration:

We need to configure the system in a proper manner to get

the most optimal performance using our installation. The

different system configurations is explained below.

1.3.1 Apache Hadoop

For the Apache Hadoop installation, we need a stable

Ubuntu Linux release (14.04.1 LTS which is the latest release

as on 4th November 2014). Hadoop tar file (Version 1.2.1) to

install Hadoop, latest Java version and SSH is to be installed.

1.3.2 Tableau

• Microsoft Windows Server 2012, 2012 R2, 2008,

2008 R2, 2003 R2 sp2 or higher; Windows 8 or 7

• 32-bit or 64-bit versions of Windows

• Minimum of a Pentium 4 or AMD Opteron processor

• 32-bit color depth recommended

2. Project Operation:

This is practical management of a project. Here, project

inputs are transformed into outputs to achieve immediate

objectives.

Page 5: project report.pdf

Spring 2015

Following Figure 8 explains when data set is being

inputted and analyzed, output is obtained, categorized and

refined on user demands, which could be related to a

particular region or state. User can further search for average

Medicare payments as well as total payments.

Figure 6: Data Flow

III. PROJECT DATA USED

The Data Input for our project is taken from

Data.CMS.gov (https://data.cms.gov/Medicare/Inpatient-

ProspectivePayment-System-IPPS-Provider/97k6-zzx3) which

is unstructured data (medical records) and structured data

(location specific data). The data set is given by CMS and was

updated on 06/02/14. Original FY2011 data file has been

updated to include a new column, "Average Medicare

Payment." The data provided here include hospital-specific

charges for the more than 3,000 U.S. hospitals that receive

Medicare Inpatient Prospective Payment System (IPPS)

payments for the top 100 most frequently billed discharges,

paid under Medicare based on a rate per discharge using the

Medicare Severity Diagnosis Related Group (MS-DRG) for

Fiscal Year (FY) 2011. These DRGs represent more than 7

million discharges or 60 percent of total Medicare IPPS

discharges. Hospitals determine what they will charge for

items and services provided to patients and these charges are

the amount the hospital bills for an item or service. The Total

Payment amount includes the MS-DRG amount, bill total per

diem, beneficiary primary payer claim payment amount,

beneficiary Part A coinsurance amount, beneficiary deductible

amount, beneficiary blood deducible amount and DRG outlier

amount.

For these DRGs, average charges, average total payments,

and average Medicare payments are calculated at the

individual hospital level. Users will be able to make

comparisons between the amount charged by individual

hospitals within local markets, and nationwide, for services

that might be furnished in connection with a particular

inpatient stay.

The definitions for the terms being used in the data set are

given as follows:

DRG: Code and description identifying the DRG. DRGs are a

classification system that groups similar clinical conditions

(diagnoses) and the procedures furnished by the hospital

during the stay.

Provider ID: Provider Identifier billing for inpatient hospital

services.

Provider Name: Name of the provider.

Provider Street Address: Street address in which the provider

is physically located.

Provider City: City in which the provider is physically

located.

Provider State: State in which the provider is physically

located.

Provider Zip Code: Zip code in which the provider is

physically located.

Hospital Referral Region Description: HRR in which the

provider is physically located.

Total Discharges: The number of discharges billed by the

provider for inpatient hospital services.

Annual Covered Charges: The provider's average charge for

services covered by Medicare for all discharges in the DRG.

These will vary from hospital to hospital because of

differences in hospital charge structures.

Average Total Payments: The average of Medicare payments

to the provider for the DRG including the DRG amount,

teaching, disproportionate share, capital, and outlier payments

for all cases. Also included are co-payment and deductible

amounts that the patient is responsible for.

Average Medicare Payments: Money.

Data- Set Snapshots:

Page 6: project report.pdf

Spring 2015

Figure 7: Inpatient Prospective Payment – Part 1

Figure 8: Inpatient Prospective Payment - Part 2

The raw file had few hiccups likes comma, dollar sign in the

few columns and which had to be removed in order to

the data in their intended columns in HIVE. Thus we had to

cleanse the data before to import the data to Hadoop fil

structure.

Part 1

Part 2

comma, dollar sign in the

order to transfer

the data in their intended columns in HIVE. Thus we had to

to import the data to Hadoop file

Figure 9: Cleaned Data

IV. BIG DATA RELATED IMPLEMENTATION

Hadoop Map/Reduce ships data to the nodes that have the

required data, and the nodes then process the data in parallel.

This approach leverages data locality. The cleansed DGV

is transferred to Hadoop file system.

Figure 10: DGV Data on Hadoop

The reduce part takes the output of the Map phase and

produces and single output. Pairing of MapReduce with

HDFS works well because HDFS provides high bandwidth

across a large cluster.

: Cleaned Data

IMPLEMENTATION PROCESS

Hadoop Map/Reduce ships data to the nodes that have the

required data, and the nodes then process the data in parallel.

. The cleansed DGV file

: DGV Data on Hadoop

The reduce part takes the output of the Map phase and

produces and single output. Pairing of MapReduce with

HDFS works well because HDFS provides high bandwidth

Page 7: project report.pdf

Spring 2015

Figure 11: Map Reduce on Hadoop

A Table is created on the hive with structure similar to the

columns given in the raw file. Later the data is loaded from

the Hadoop file structure to the table.

Figure 12: DGV Data on HIVE

We have used Flume to capture data log files regarding Ebola

on Twitter. Based on this data, we can do the analysis of

various factors like top 5 countries Tweeting about it, the

places where it has been affected we can analyse the sentiment

of the Tweeters if they are being positive, negative or neutral

about Ebola. So far the Tweets regarding Ebola has been

captured by Flume and transferred to Hadoop File System.

Figure 13: Ebola Tweets on HDFS

V. PROJECT INPUT AND OUTPUT

The traditional IT infrastructure was not able to satisfy the

people in this new era of “Big Analytics”. As the result, many

enterprises are turning to the open source projects like “R”

statistical programming language and Hadoop as a probable

better response to this unmet commercial requisite.

Hadoop is an Apache product that does a parallel

processing of data across a multiple systems using a

programmable model. It mainly consists of HDFS, HBASE

for storage and Map reduce for distributed computing. While

R is a free software environment for performing statistical

computing as well for visual representation of the data. There

are vast diverse fields where this language is being

implemented like classification, scoring, finding relationships,

characterization, ranking and clustering.

HDFS Overview:

Hadoop includes a fault tolerant storage system called the

Hadoop Distributed File System, or HDFS. Hadoop Cluster

interconnected on a network acts as scalable factor, with the

power to withstand failure without data loss. The HDFS is

means of storage in Hadoop. R object as well as other models

can be stored in HDFS and later retrieved using the

MapReduce job. The MapReduce job even inscribes the result

back to HDFS once they are done with execution. These

results are later inspected and analyze by means of R language

thus making this essentially significant functional unit in the

process.

In order to facilitate a friendly environment while working

with HDFS there are several layers that are on top of it, one of

that is HBASE. This essentially provides table structures

similar to databases. HBASE aids in opening up the Hadoop

framework to the R programmer.

Page 8: project report.pdf

Spring 2015

Figure 14: HBASE Overview

MapReduce – Data Reduction: MapReduce framework is the

processing pillar in the Hadoop environment. The framework

has few specific procedure that could be applied to an

enormous data set, fragment the problem and data, and run it

in parallel. The outcome of these operation are yielded to

HDFS/HBASE and later this could be analyzed using an R

language.

The R code can be integrated with MapReduce jobs. This

type of the implementation elevates the kinds and size of

analytics that be applied to titanic datasets. In this type of

process the model is pushed to task nodes of the Hadoop

cluster, then executing MapReduce jobs loads the model into

R on the task node. Data here can be either aggregated or

processed row `by row as per the requisite and then later the

results are hoarded on HDFS.

Visual representation the datasets assists in understanding

the data. Thus a binning algorithm in R is executed as a

MapReduce job and the output this process can be used an

input R client to render the representation.

Figure 15: MapReduce – R

VI. PROJECT EXPERIMENTATION RESULTS GAINED

Data Representation:

The information is presented using simple dashboard yet very

effective, interactive and user friendly. The Business

dashboards are mainly presented in sundry forms and also in

various dimensions. In this project there are 15 reports

integrated in 5 dashboard. The key performance indicators

(KPIs) are used in the dashboard to indicate the performance

of the Medicare system. The strategy possibly will be polished

based on these indicators.

The key rudiments that are going to play a crucial in

designing the dashboard are:

• It’s artless (simple) and can communicates easily.

• Least possible distractions, leading to less confusion.

• Will be supporting the organization with handy and

as well as the meaning data.

The dashboard will be basically integrated to a simple

HTML page while adhering to the elements above and with

selected KPIs.

Integration features like drill down and drill through, so

the users can have detailed and very accurate information on

the DRG.

Steps followed while designing the dashboard:

1. Defining the KPIs to observer:

A lot of information is available on the Inpatient-

Prospective-Payment-System. Following KPIs will be used to

will displaying the data.

• Averaged Covered Charges.

• Average Total Payments.

• Average Medicare Payments.

• Total Discharges.

2. Visualizing the data:

After defining the KPIs, the next step is to represent KPIs

using the charts. Diverse charts like the pie chart, bar chart,

Maps, bubble charts and the tables are used.

Figure 16: Summary across DGV – MIN, MAX, MEAN

Page 9: project report.pdf

Spring 2015

Figure 17: Details of DGV selected – DRILL THROUGH

Figure 18: Details of DGV selected – DRILL DOWN (State City street

zipcode)

3. The auto update of the Dashboard: In this section where database will come into play. So

that each time the dashboard is used it gives the effective up to

date information’s. The frequency at which new data is

available and how important the information is are the

important key factor will be considered while designing this

section.

4. Export Features: In order the preserve the information the export feature

will be integrated to the dashboard. The dashboard could be

transformed to excel, Image, text file (Data) or the PDF with

aid of export feature.

THROUGH

DRILL DOWN (State City street

In this section where database will come into play. So

that each time the dashboard is used it gives the effective up to

date information’s. The frequency at which new data is

available and how important the information is are the

while designing this

preserve the information the export feature

will be integrated to the dashboard. The dashboard could be

or the PDF with

Figure 19: Export Features

VII. PROJECT RESULT

The chart below represents the data visualization done

with respect to the state in the country. It is the categorization

of the DRG of USA. In the following example, we see the

DRG for Utah. We can see the pie chart for the DRG category

in that state and the bubble chart for the city per state.

Figure 20: Categorize the DRG – STATE wise (UTAH)

Figure 21: Details of DRG of city wise (Salt Lake City

Export Features

RESULT ANALYSIS

The chart below represents the data visualization done

with respect to the state in the country. It is the categorization

of the DRG of USA. In the following example, we see the

Utah. We can see the pie chart for the DRG category

in that state and the bubble chart for the city per state.

STATE wise (UTAH)

Lake City) –DRILL THROUGH

Page 10: project report.pdf

Spring 2015

The above chart shows the drill-through of the DRG state

wise. We can see the bar graph plotted for the DRG vs

Average Total Payments.

The bar chart below shows the Average Total Payments vs

DRG. The red line in the middle indicates the target to be

achieved by the hospital to get a certain revenue for the

hospital. This helps the hospital administration to set targets

and see where they need to improve and where they are doing

well.

Figure 22: Target vs Actual – across hospital

The below bar chart represents the Safe areas to live in based

on the Discharge Rates. City within a region vs Discharge

Rates. The user can see which hospital to go based on these

statistics to make an informed decision. The user can choose

the hospital based on the area based on the number of

discharges in the locality. When we click on a particular area,

the values also change dynamically giving the user the best

experience to make the right decision.

Figure 23: Safe areas to live in

The below bar chart shows the Total Covered Charges vs

Cities in the region. We can evaluate the insurance covered

with respect to the city in the region. This can help to find the

place close to the user where the hospital covers the charges

levied by the hospital for the respective DRG. The higher

values indicate the higher amounts of the insurance amounts

through of the DRG state

wise. We can see the bar graph plotted for the DRG vs

The bar chart below shows the Average Total Payments vs

DRG. The red line in the middle indicates the target to be

al to get a certain revenue for the

hospital. This helps the hospital administration to set targets

and see where they need to improve and where they are doing

across hospital

areas to live in based

on the Discharge Rates. City within a region vs Discharge

Rates. The user can see which hospital to go based on these

statistics to make an informed decision. The user can choose

the hospital based on the area based on the number of

discharges in the locality. When we click on a particular area,

the values also change dynamically giving the user the best

the Total Covered Charges vs

Cities in the region. We can evaluate the insurance covered

with respect to the city in the region. This can help to find the

place close to the user where the hospital covers the charges

levied by the hospital for the respective DRG. The higher

icate the higher amounts of the insurance amounts

paid as the coverage charges. The user can make a decision

based on these inputs to choose the right hospital for the DRG.

Figure 24: Safe areas to live in

VIII. FUTURE SCOPE

Ebola, an infectious and generall

infected more than 6,500 people and claimed more than 3,000

lives in the world so far, according to the latest numbers from

WHO. That puts the fatality rate around 47 percent.

technologies are becoming super important in the

Ebola and strives to terminate the further spread of the

disease. The Big Data technology paves way for vast sheaths

of information to be combined and refined from a variety of

sources while eliminating extraneous information along the

way. The future scope of our tool is to analyze the data of the

deadly Ebola disease which requires being able to gather

unstructured data as soon as it is generated, by any number of

organizations from across the globe. Using information

gathered from a wide range of sources, such as social media,

apprise from hospitals and flight records and information,

authorities can develop novel insights into where and how to

respond. This helps not only in saving lives, it also can make

sure that resources are allotted on priorities.

We have used Flume to capture data log files regarding

Ebola on Twitter. Based on this data, we can do the analysis

of various factors like top 5 countries Tweeting about it, the

places where it has been affected we can analyse the sentiment

of the Tweeters if they are being positive, negative or neutral

about Ebola. So far the Tweets regarding Ebola has been

captured by Flume and transferred to Hadoop File System.

paid as the coverage charges. The user can make a decision

based on these inputs to choose the right hospital for the DRG.

Safe areas to live in

SCOPE

Ebola, an infectious and generally fatal disease has infected more than 6,500 people and claimed more than 3,000

lives in the world so far, according to the latest numbers from

WHO. That puts the fatality rate around 47 percent. Emerging

technologies are becoming super important in the fight against

Ebola and strives to terminate the further spread of the

disease. The Big Data technology paves way for vast sheaths

of information to be combined and refined from a variety of

sources while eliminating extraneous information along the

The future scope of our tool is to analyze the data of the

deadly Ebola disease which requires being able to gather

unstructured data as soon as it is generated, by any number of

organizations from across the globe. Using information

ange of sources, such as social media,

apprise from hospitals and flight records and information,

authorities can develop novel insights into where and how to

respond. This helps not only in saving lives, it also can make

n priorities. We have used Flume to capture data log files regarding

Ebola on Twitter. Based on this data, we can do the analysis

of various factors like top 5 countries Tweeting about it, the

places where it has been affected we can analyse the sentiment

of the Tweeters if they are being positive, negative or neutral

about Ebola. So far the Tweets regarding Ebola has been

captured by Flume and transferred to Hadoop File System.

Page 11: project report.pdf

Spring 2015

Figure 25: Ebola Tweets on HDFS

IX. PROJECT SUMMARY

Inpatient Prospective Payment System (IPPS) Provider

Summary was analyzed to identify various statistics based on

city, state and other criteria This information was presented on

the website created to enable users to easily access the crime

data in that area.

After studying the project requirements and the given data

set, all the column fields were defined and analyzed. The

given data input consists of structured and unstructured data.

The medical data is unstructured data and the hospital location

(street, state, zip code) is the structured data.

Since, we had both kinds of data to handle, we used

several concepts like Hadoop to store both structured and

unstructured data. As well, Hadoop enables distributed

parallel processing of huge amounts of data across

inexpensive, industry standard servers that both store and

process the data, and can scale without limits.

We have used the R Statistical Tool to analyze the data

since it is an effective and an open source project, so the

inspection of code was much easy. R is an interactive

language, and it promotes experimentation and exploration.

This project is the best platform to visualize the

implementation of R and its benefits. The minimum system

requirements were used to get the optimal performance.

The data representation was done using a simple yet very

effective, interactive and user friendly Dashboard using

Tableau. We chose our data representation tool to be Tableau

Dashboard because it’s helpful not only in summing the

details but it also looks into the desired key features needed

for the project. The reports are deployed on to the cloud so

any one in the need of the information can easily access the

reports.

In short, these helped us take smart and faster decisions

valuing the time constraint. After the successful

implementation of our project the user was able to easily

select a type of DRG he/she desires for his preferred state,

region, city, Zip Code and many more search features, which

made it very simple and cost effective for the users by looking

up for the expenses that will be incurred based on selection.

Moreover, the users are able to look up for the number of

patients that are being cured up in a hospital for similar

disease. So, they can easily check about how widespread the

disease is and how effective the treatment in a hospital is. And

the disease prevalent areas were also found out which is a

good thing to alert the nearby people.

REFERENCES

[1] David, S, “The Marriage of Hadoop and R: Revolution Analytics at

Hadoop World”, http://www.r-bloggers.com/the-marriage-of-hadoop-and-r-revolution-analytics-at-hadoop-world/, November 11, 2011.

[2] Revolution Analytics Advanced ‘Big Data’ Analytics with R and Hadoop, 2011.

[3] “Inpatient Prospective Payment System (IPPS) Provider Summary for

the Top 100 Diagnosis-Related Groups (DRG)”,

http://www.revolutionanalytics.com/sites/default/files/r-and-hadoop-

big-data-analytics.pdf, 2011. [4] Bart, “Creating a Business Dashboard in R”, http://www.r-

bloggers.com/creating-a-business-dashboard-in-r/, March 28, 2013. [5] Katie, “The Importance of Dashboards”,

http://www.thetingleyadvantage.com/2013/06/the-importance-of-

dashboards.html, June 20, 2013. [6] Dhruba Borthakur, “The Hadoop Distributed File System: Architecture

and Design”, The Apache Software Foundation, http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf, 2007.

[7] Amirtha, T, “Why the R Programming Language is Good for

Business”, http://www.fastcolabs.com/3030063/why-the-r-

programming-language-is-good-for-business/, May 5, 2014.

[8] Anderson, T, “Implementing Process”, UTS:Project Management, http://www.projects.uts.edu.au/stepbystep/implementing.html, April 6,

2006. [9] “Welcome to ApacheTM Hadoop”, Apache Hadoop Software

Foundation, http://hadoop.apache.org/, April 10, 2014.

[10] “Hadoop-common”, Apache Hadoop Common Github page, https://github.com/apache/Hadoop-common.

[11] “Managing Hadoop Projects: What You Need to Know to Succeed”, TechTarget, http://searchcloudcomputing.techtarget.com/definition/Ma

pReduce, Feb 8, 2010.

[12] Murthy, A, “Apache Hadoop YARN - Background and an Overview”, http://hortonworks.com/blog/apache-Hadoop-yarn-background-and-

an-overview/, August 7th, 2012.

[13] Loughran, S, “PoweredBy”,

https://wiki.apache.org/Hadoop/PoweredBy, Feb 16, 2014.

[14] "Hadoop Tutorial 1 - what is Hadoop ?",

http://zerotoprotraining.com/index.php?mode=video&id=323

[15] "Hadoop", http://www.edevzone.com/hdfs/.

[16] "How to crunch your data stored in HDFS?",

http://blog.octo.com/en/how-to-crunch-your-data-stored-in-hdfs/.

[17] "Get started on Hadoop", http://hortonworks.com/tutorials/.

[18] "Hadoop Distributed File System (HDFS) Introduction", http://hortonworks.com/Hadoop/hdfs/.

[19] “Adventures in Data”, http://bigdata.wordpress.com/2010/03/22/security-in-Hadoop-part-1/.

[20] “Understanding Hadoop Clusters and the Network”, http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-

and-the-network/.

[21] “HDFS Architecture”,

http://hadoop.apache.org/docs/r0.19.0/hdfs_design.html, April 21,