9
Preparing a Dataset for Processing Contents Introduction Obtaining data from an original source ETL process (Extract, Transform and Load) Using the SAS University Edition Virtual Machine Using the dataset for production Author : Manish Chopra Date : 15th April 2017

Preparing a Dataset for Processing

Embed Size (px)

Citation preview

Page 1: Preparing a Dataset for Processing

Preparing a Dataset for Processing

Contents

Introduction

Obtaining data from an original source

ETL process (Extract, Transform and Load)

Using the SAS University Edition Virtual Machine

Using the dataset for production

Author : Manish Chopra

Date : 15th April 2017

Page 2: Preparing a Dataset for Processing

Introduction This tutorial presents how a new dataset can be prepared for processing by joining multiple Excel files

into a single large CSV (Comma-Separated Values) file. The final dataset you arrived at later can be used

with RDBMS systems and Big Data based NoSQL systems. It is desirable and often essential to have

many such datasets with prospective relationships among them in the event of scaling up towards a

larger production system.

In this document, we will see how a dataset can be prepared from the data files downloaded from

Indian government's open data portal. Following are two examples taken from the site:

Primary Census Abstract 2011 - India and States

Company Master Data up to 31st March 2015

There are multiple datasets and categories available on the portal with descriptions of each dataset

available. This makes it easy to schematize the database that you will be creating for your project.

Obtaining data from an original source Indian government data site (https://data.gov.in) publishes several datasets on the given portal.

Following figure displays datasets and categories available on the site that we can navigate and select

the files to be downloaded. An API is also provided that can be used to connect to the datasets using

internet as the connection medium.

Figure 1 : Indian government data portal's primary page

The Indian government data site provides a rich set of data available that can be put for production use.

Our approach here is to get started with a few categorized datasets like country's population statistics

and company data statistics, and consolidate them into single large datasets.

Page 3: Preparing a Dataset for Processing

For each of the examples taken here, that is Primary Census Data and Company Master Data, there are

35 files available on the site, each representing one state of India. These files are downloaded from the

URL given above and placed in one directory. Following is an extract from the directory listing of these

files.

PCA0000_2011_MDDS.xls PCA0100_2011_MDDS.xls ......... ......... PCA3400_2011_MDDS.xls PCA3500_2011_MDDS.xls

ETL Process (Extract, Transform and Load) We can either open these files manually and make a consolidated file by combining them one by one, or

they can be processed in a batch at once. We take the latter approach as it automates most of the

required manual work. The XLS files are first converted into the CSV (Comma-Separated Values) format

through a tool known as "XLS to CSV Converter" that batch converts all the files in a single operation.

We now have 35 CSV files as the output of XLS to CSV conversion. After this, copy the folder containing

CSV files in a Linux machine, and open the terminal. Here, use the following command to generate a

single file:

[linux@localhost CSVs]$ cat PCA* >> /tmp/Consolidated-Population-Dataset.csv

Above command will read all the 35 files starting with PCA filename and store the output file as

Consolidated-Population-Dataset.csv. If the header rows of these files were not removed prior to this

operation, they will be inserted into several rows of the new CSV file. Remove the duplicate headers

from the file, and your dataset is ready. The new CSV file can be opened in MS Excel or used by other

applications that support the format.

The same process has been performed for "Company Master Data" excel files available on the Indian

government data portal.

The resultant file is approximately 433 MB in size, which could not be opened in MS Excel due to the

limitation of opening a maximum of 1 million rows. But this file contained around 1.45 million rows.

In order to overcome the limitation that MS Excel exerts, we shall use a SAS software that can handle

this large dataset. Other databases too can handle large datasets, like Oracle, MySQL, SQL server and

NoSQL databases.

Page 4: Preparing a Dataset for Processing

Following is the warning message that MS Excel threw up when the 433 MB file was opened:

Figure 2 : Microsoft Excel restricting maximum number of rows to 1 million

Following is the complete text as appeared in the MS Excel warning message boxes given above.

This message can appear due to one of the following: The file contains more than 1,048,576 rows or 16,384 columns. To fix this problem, open the source file in a text editor such as Microsoft Office Word. Save the source file as several smaller files that conform to this row and column limit, and then open the smaller files in Microsoft Office Excel. If the source data cannot be opened in a text editor, try importing the data into Microsoft Office Access, and then exporting subsets of the data from Access to Excel. The area that you are trying to paste the tab-delineated data into is too small. To fix this problem, select an area in the worksheet large enough to accommodate every delimited item. Notes Excel cannot exceed the limit of 1,048,576 rows and 16,384 columns. By default, Excel places three worksheets in a workbook file. Each worksheet can contain 1,048,576 rows and 16,384 columns of data, and workbooks can contain more than three worksheets if your computer has enough memory to support the additional data.

In such a scenario, we can edit the CSV files in Linux, and remove the header from each of the 35 files

either in vi editor or using a small script that eliminates the first row of each file. This is a data cleansing

feature as we do not want the header rows to appear inside the data rows when arriving at the

consolidated dataset.

Using the SAS University Edition Virtual Machine SAS is a collection of many software tools - A data analysis tool, a programming language, a statistical

package, business intelligence tool, and more.

SAS University Edition runs in a virtual environment on any computer that can run either VMware

Player, Fusion, or Oracle Virtual Box. Requirements for running the SAS University Edition are displayed

when you download the SAS University Edition, meant for non-commercial use.

The SAS University Edition uses SAS Studio as the interface. SAS Studio provides an environment that

includes a point-and-click facility for performing many common tasks, such as producing reports, graphs,

Page 5: Preparing a Dataset for Processing

data summaries, and statistical tests. For those who either enjoy programming or have more

complicated tasks, SAS Studio also allows you to write and run your own programs.

As per SAS website, following are some benefits of using SAS University Edition:

Statistics and quantitative methods in a variety of areas : economics, psychology and other

social sciences, computer science, business, medical/health sciences, engineering, etc.

Introductory to advanced-level statistics and quantitative methods

SAS programming and statistical analysis

A consistent user experience across all applications

Figure 3 : Features of SAS University Edition

Features of SAS University Edition

SAS Studio - An intuitive interface lets you interact with the software from Windows, Mac or

Linux workstation.

Base SAS - A powerful programming language is easy to learn, easy to use.

SAS/STAT - Comprehensive, reliable tools include state-of-the-art statistical methods.

SAS/IML - A robust, yet flexible matrix programming language enables more in-depth,

specialized analysis and exploration.

SAS/ETS - Several time series forecasting procedures – TIMEDATA, TIMESERIES, ARIMA, ESM,

UCM and TIMEID are included.

SAS/ACCESS - Out-of-the-box access to PC file formats provide a simplified approach to

accessing data.

Page 6: Preparing a Dataset for Processing

Powerful statistical software

With SAS University Edition, you get SAS Studio, Base SAS, SAS/STAT, SAS/IML, SAS/ACCESS and several

time series forecasting procedures from SAS/ETS. It's the same world-class analytics software used by

more than 80,000 business, government and university sites around the world, including 93 of the top

100 companies on the Fortune Global 500 list. So you'll be using the most up-to-date statistical and

quantitative methods.

Fill the skills gap

By 2018, demand for workers skilled in analytics could outpace supply by 60 percent, or 1.5 million jobs

according to a McKinsey Global Institute study.

SAS University Edition Virtual Machine can be downloaded from SAS website at the link give below:

https://www.sas.com/en_us/software/university-edition.html

Further, one can follow the book titled "An Introduction to SAS University Edition" by Ron Cody to get

well versed with SAS analytics. The book comes with example code and datasets to work on the

exercises given in it. To know more, there is ample of documentation available on SAS website.

Once through with setting up your virtualization environment, like VMware or Virtual Box, import the

downloaded ova file, and start SAS University Edition Virtual Machine. The VM startup will be as follows:

Figure 4 : Starting SAS University Edition Virtual Machine

Page 7: Preparing a Dataset for Processing

After the VM loads completely, it would display a screen as given in the image below, along with a URL

to get connected to it through a web browser.

Figure 5: Terminal Screen of the SAS Virtual Machine

Now open the above URL in your chosen web browser like Chrome or Firefox.

Figure 6: Web GUI Screen connected to the SAS Virtual Machine

Page 8: Preparing a Dataset for Processing

Working with the SAS VM

As previously mentioned, MS Excel has a limitation on the maximum number of rows and columns. Here

is a screenshot displaying the maximum number of rows (1048576) that MS Excel could display when

the 433 MB file was opened.

Figure 7 : Maximum rows in Microsoft Excel - 1048576

The same 433 MB CSV file was successfully imported into the SAS virtual machine, as shown in the figure

below.

Figure 8 : SAS web GUI displaying the starting row of Companies Dataset

Page 9: Preparing a Dataset for Processing

These images are screenshots of web browser interfaces connected with SAS University Edition VM. The

previous image shows the starting range of the dataset, and below you will find the last dataset row.

Figure 9 : A total of 1459085 rows in Companies Dataset

In the image above we see 1459085 records were imported in to the SAS virtual machine successfully.

Using the Dataset for Production Although as of now, we have generated a single file of 433 MB, that can either be put to use as a single

file, or be place in a Relational Database or a NoSQL Database, to be able to provide inferences through

SQL statements.

A highly complex database schema can span 1000's of tables having numerous relationships among

them, much like how our brain works, or how a regular computing network switch works like mesh,

where many-to-many transactions take place continuously.

There may be several ways to prepare datasets and this was one of the methods adopted. Further we

saw that certain applications are not well suited to process a large dataset. In another tutorial we shall

see how these two datasets are put to use in applications.