VOL. 7, ISSUE 3, JULY - SEPT 2014 - KPIT Technologies. 7, ISSUE 3, JULY - SEPT 2014 DATA ANALYTICS l l l Data Analytics Toolbox Role of Mathematics in Analytics Let's Make Data Talk

VOL. 7, ISSUE 3, JULY - SEPT 2014

DATA ANALYTICSl

l

l

Data Analytics Toolbox

Role of Mathematics in Analytics

Let's Make Data Talk l

l

l

Extracting Big Money from Small Money

Data Analysis for Warranty Claims

Analytics Applied to Real World

Colophon

TechTalk@KPIT is a quarterly journal of Science and Technology published byKPIT Technologies Limited, Pune, India.

Dr. Vinay G. VaidyaCTO,KPIT Technologies Limited,Pune, [email protected]

Shiva Ghose

Mind’sye Communication, Pune, IndiaContact : 9673005089

[email protected]

The individual authors are solely responsiblefor infringement, if any.All views expressed in the articles are thoseof the individual authors and neither the companynor the editorial board either agree or disagree.The information presented here is only for giving anoverview of the topic.

For Private Circulation OnlyTechTalk@KPIT

Guest Editorial

Chief Editor

Editorial and Review Committee

Designed and Published by

Suggestions and Feedback

Disclaimer

Dr. Anjali KshirsagarDirectorCentre for Modeling and SimulationUniversity of Pune,Pune, India

Aditi SahasrabudheChaitanya RajguruPranjali ModakPriti Ranadive

Editorials

Scientist Profile

Book Review

Articles

Research Publications

Guest Editorial 2

Dr. Anjali Kshirsagar

Editorial 3

Dr. Vinay Vaidya

Thomas H. Davenport 41

Smitha K. P.

Predictive Analysis: 35

The power to predict who will click, buy, lie and die

Eric Siegel

Aditi Sahasrabudhe

Let's Make Data Talk 4

Mayurika Chatterjee

Data Analytics Toolbox 10

Ramraju Indukuri

Role of Mathematics in Analytics 16

Sushant Hingane

Extracting Big Money from Small Money 22

Shiva Ghose

Data Analysis for Warranty Claims 28

Vaishali Patil

Analytics Applied to Real World 36

Abhinav Khare

42

Contents

TechTalk@KPIT, Volume 7, Issue 3, 2014

1

Guest Editorial


2

Data analytics is the science of examining and analyzing raw data for the purpose of drawing conclusions and possible predictions about the underlying information. Although “big data” is not new to scientific community looking at the applications in weather forecasting or astronomical and astrophysical sciences, “big data” now seems to be a reality due to its presence in everyday life of a common man also. Internet and its usage have opened up new avenues for big data analytics. Share market and oil exploration are two areas which extensively require big data analytics and are closer to the heart of a common man due to their influence in everyday life. Financial sector heavily depends on data analytics tools to predict market trends, to optimize warranty and insurance claims applied to the real world and to mobilize customers. As a very common everyday life example of the application data analysis, I wish to quote that in some countries, the insurance companies levy higher insurance premiums for red colored cars since data predictions have indicated that those who buy these cars are aggressive drivers and are more prone to accidents. One of the articles in this issue quotes example of getting reliable information from data on the fly to keep up with the customers no matter how often they change, to predict customer response to various attractive offers of the dealers and to strike the right balance between analytics and expert judgment. Business houses rely on this striking balance for diversification of their business. Energy companies dealing with oil and gas are increasingly diverting part of their resources towards non-conventional energy depending to a large extent on their business analytics and intelligence team.

Analysis of data is the process of inspecting, cleaning, transforming, and modeling data with the goal to discover useful information, to draw conclusions and hence to suggest predictions to support decision making. It converts both structured and unstructured data into useful information and finally enables one to gain knowledge. It also involves finding patterns in heterogeneous sources of data. There is dire need for better handling of available data and good modeling tools with a wide choice of analysis algorithms for both structured and unstructured data types. Data analysis has multiple facets; it encompasses diverse techniques and is useful in different business, science, and social science domains. It is said that big data is the fuel while predictive analysis is the engine driven by this fuel to drive the life of society.

The volume, variety and velocity of data coming into an organization continue to reach unprecedented levels. Advanced statistical, data mining and machine learning algorithms do exist, however their usefulness has increased many folds due to availability of higher computational power, cheaper memory – both at processing and storage level and awareness to discover and deploy knowledge thus gained for profit of the organization, may it be at business level or individual level. Cloud computing is a boon to organizations using big data analytics. In addition, modern telecommunication tools have become cheaper so that data analytics does not add additional cost to the customers/clients.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business analytics and intelligence covers data analysis that relies heavily on aggregation, focusing on business information for commercial applications. Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.

Theoretical models used in scientific research are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation to predict a kind of response of the experimental set-up to a given theoretical event, thus producing simulated events which are then compared to experimental data. Calibration and validation of the theoretical models forms an important part of computational research in physical, chemical and biological sciences. Stability of results, cross-validation and sensitivity analysis are very important for any statistical modeling tool. Deterministic approach, on the other hand, depends on the setting up of model in terms of differential equations and the initial or boundary conditions. Non-linear problems involve additional degree of crucial dependence on the initial conditions.

Organizations are increasingly becoming analytics-driven to achieve breakthrough results and outperform their peers. However, it is important (i) to optimize all types of decisions either taken by individuals or are embedded in automated systems, using insights based on analytics, (ii) to provide insights from all perspectives and time horizons including historic reporting to real-time analysis with predictive modeling, (iii) to improve business outcomes and manage risk by empowering people in all roles to explore and interact with information and deliver insights to others.

This special issue of TechTalk on Data Analytics presents articles on mathematical and statistical techniques used to deal with big data and is a timely publication from various perspectives.

Professor of PhysicsDirector of Centre for Modelingand SimulationDirector of InterdisciplinarySchool of Scientific ComputingUniversity of Pune, Pune, India

Dr. Anjali Kshirsagar

Dr. Anjali Kshirsagar isone of the pioneers tostart research inComputational Sciencesin the western part of thecountry in the lateseventies and isresponsible to developthe culture of highperformance computingin scientific research.

Dr. Vinay G. VaidyaCTO KPIT Technologies Limited,Pune, India

Editorial

Please send yourfeedback to :[email protected]

One Exabyte or one million terabytes is the amount of data that is generated in the world every year. This

includes data from social sites, blogs, online transactions, news, pictures, mails, commercial websites, etc.

Then comes scientific data, scientific journals, financial data, and so on. Today, we have the ability to gather as

much data as we wish in every field. Our understanding has greatly improved in all scientific fields due to the

analysis of collected data. This data has helped scientists interpret the data, formulate mathematical models,

make predictions, and validate them for making further interpretations and predictions.

Data analytics has benefitted us in multiple ways. Today our ability to predict earthquakes and volcanic

eruptions has drastically improved thus making it impossible to see repetition of the unpredicted eruptions such

as the one that took place in Pompeii. Our understanding of cosmology has vastly improved in the past 100

years; thanks to the data we get from radio telescopes as well as the Hubble telescope. The Square Kilometer

Array (SKA), the largest radio telescope under construction, will search the sky with the ability to detect airport

radars in solar systems that are 50 light years away. The SKA will generate the same amount of data that we

generate every year in the world. The difference is the SKA will generate one Exabyte of data in just one day!

We certainly need a much better ability to process such data.

In the field of transportation we are beginning to collect more data. This data would enable us to diagnose

automotive systems better thereby helping us build robust systems. Models developed using this data would

enable us in building better tools for prognosis. Overall, data will change user experience in the transportation

industry. However, before we get to that level of accurate prognosis, we need to improve our understanding of

mathematics.

Data analytics is all about mathematics. There are number of mathematical methods available to extract

knowledge out of the data. However, it is an art to use the right set of equations to get what one is looking for.

The process of data conversion into an appropriate mathematical model is a black art and I do not see this

process ever getting automated. Today, there are many teams working all over the world trying to come up with

a model for their data sets. Not everyone succeeds. Besides not all the teams come to the same set of

equations in the end. There are multiple paths to reach the goal of prediction. The success or failure is

measured by the amount of error in prediction. There are also common pitfalls in methodology deployed to

tackle the problem.

Often I come across people who get confused by deterministic and non-deterministic data sets. Deterministic

phenomenon can be defined by specific equations and one would never observe any deviation from those

equations. Ohm's law is deterministic while Brownian motion is non-deterministic. Mathematical methods for

deterministic phenomenon are different than for non-deterministic. Once one engineer came to me and he told

me that he was using Monte Carlo method for determining state of charge of a battery. This is a classic example

of using wrong methods for trying to find answers. Obviously the end result would be garbage in garbage out.

In the endeavor of better understanding, we should remember that data is meaningless without correct

interpretation. Interpretation is short lived without formulation of a mathematical model. Mathematical model is

mere toying with equations without its good ability for prediction. Prediction is gambling without validation.

Validation is endorsement and the final stage in the pursuit of knowledge.


3


4

Mayurika Chattergee

Areas of InterestMechatronics and Control Systems

About the Author

Let's Make Data Talk


5

I. Introduction

“Information is the oil of thest

21 century, and analytics is the

combustion engine”

- Peter Sondergaard of the Gartner Group

One may wonder why data analytics! Well,

over the recent years it has become easier to

collect and store huge amount of data. Data

perceived by different people is different. Data

for a computer engineer will be a set of

numbers but data for a medical practitioner will

be some kind of medical history. The aim is to

convert this huge data into information and

then convert this information to knowledge.

Such knowledge gives power to an

organization to make educative decisions

about its future. Data analytics can be applied

to wide range of domains; it can help business

analysts to make informative business

decisions or medical researcher. It can also be

helpful for the political analysts to describe a

political phenomenon or even to test

theories/hypothesis about political scenarios.

Let us start with what do we mean by data.

According to [1], “Data are values of

qualitative or quantitative variables, belonging

to a set of items”. Variables are a

measurement or characteristics of an item.

When we obtain data, we tend to look out for

these variables for the purpose of further

analysis. Mostly, the data we collect looks like

the one shown in Fig 1.

Figure 1 : Raw Data [Ref.2]

Raw data is not pretty! It needs processing

and thorough understanding, in order to first,

make sense out of it, and secondly to be able

to do something with it, e.g. predict future

trends.

Next in line is what this article is about - 'Data

Analytics'. In most simple language, it is about

understanding what is going on. There can be

two possibilities, one that we are in a situation

where we do not have enough data and we

have to go through different sources such as

books, web in order to find information. Then

there is the second case, where we are

overwhelmed with huge amount of data and

we need to figure out a way to extract

meaningful information from the given data.

Note that in both the cases one thing is

common, we are trying to find answer to a

question, this is one of the important aspects

of data analytics which will be clear while

going through the article.

This article introduces the data analytics term,

various categories and the basic steps

involved in the process.

Before coming to the types of analytics, let us

first learn a few related terms. The most

commonly used term in business domain is

business intelligence. It comprises of different

tools and techniques that can be applied to

analyze the data, which help the organizations

to make better business decisions [3].

Business Intelligence can be further

subcategorized into Data analytics and Data

mining. Data mining is about looking for

patterns in data that are stored in databases in

order to discover knowledge. Data Analytics

does not care about databases, but

concentrates on the specific patterns of data

as part of knowledge enhancing [4].

In this article, we are concentrating on the

analytics part of it. There are some basic

categories of data analytics, let us take an

example and try to understand the difference

amongst them.

Let us say that my job is to find out demand of

vehicle 'X' among Indian people. I need to

analyze the data gathered across the country

to find out sale of vehicle 'X'. After going

through some reports, I find out that Pune city

recorded maximum sale of the vehicle 'X' in

the year 2013. This type of analysis where we

II. Types of Analytics


6

Figure 2: LED screen with dots representing percentage sale

Figure 3: LED screen showing delivery time at respective cities

The next step is what many people in industry

term as Data analytics, known as 'Predictive

analysis'. It predicts the probable future

outcomes. We can create a model involving all

the parameters that influence the sales of a

car, apply the data to the model, and

determine various relationships among them.

We can also use the LED screen to play

around with some parameters and check

influence of each to the predicted outcome.

For, e.g. per capita income of people in city Y

will influence the sale of vehicle X, depending

on the number of people who can afford it. All

such predictions are probabilistic in nature

and cannot tell for sure what will happen but

will only tell what might happen based on the

data gathered.

This takes us to the last stage where we

integrate our predictive model with real time

data to get desired results. This is known as

'Prescriptive analysis'. This is basically one

step ahead of predictive analysis where, as

the name suggests, one or more actions are

prescribed to the decision makers showing

them the likely outcome of respective

decision. It also includes a feedback

mechanism which would track the output

based on the action taken and re-prescribe an

improved decision. In the example above, the

output of prescriptive analytics would suggest

to open more service stations and improve

infrastructure that will in turn help to increase

sales of the vehicle.

1. Define objective

The first and foremost step is to define the

objective or pose the question that we would

like to get answer to. For most of us, when we

talk about data analysis, we think that data is

most important, but the truth is, it is secondary.

The data might influence the question, but if

we are not clear on the objective, even the vast

amount of data won't be able to help us.

2. Define the ideal data set

This step will give clarity on variables to be

measured. For example, if we want to

determine whether there is relationship

between height and weight of the students in a

class, it will be sensible to measure the

quantities using a scale (weighing scale or

length scale). However, if you want to find

some correlation between qualitative

variables, for example, relating confidence of

the student with their academic successes.

What scale would you use? Compare marks

vs. IQ scores?? For this particular variable,

you might need to conduct a survey among

various teachers and find out each student's

confidence level subjectively. As famously

said by Albert Einstein “Not everything that

can be counted counts and not everything that

counts can be counted”. Thus, we should

know beforehand what kind of data would be

necessary for our purpose.

3. Obtain data

Once it is decided what types of data you want

for your analysis, next step is to acquire the

data from the available sources. The data

gathering can be done in various manners. It

could be through quantitative sensors such as

accelerometers, pressure sensors, etc.,

subjective analysis such as ride comfort of a

III. Steps in data analytics

examine the past occasions is known as

'Descriptive analytics'.

Now suppose I have an LED screen in my

office and I have red dots to represent the

percentage sale of vehicle 'X' across the

country. Also, I can see the delivery time of the

vehicle X at respective cities. If I compare

these two, in an instant I can find out

correlation between the percentage of sale

and delivery time of the vehicle X. This

category of analytics is termed as 'Exploratory

analysis'. It mostly constitutes of graphical

representation of the data and involves finding

out correlation between various parameters

for decision making. Figure 2 and 3 help in

visualizing the above example of LED screen

type scenario.


7

Figure 4: Steps involved in Cleaning of data Ref. [5]

5. Analyze

Next step is to explore the data, create graphs

etc. to understand the data, create clusters of

similar data. An Economist Ronald Coase

said, “If you torture the data long enough, it will

confess”. Analyzing data is of course a

science but an art as well. There can be

loopholes in your data, you might get

distracted by other interesting insights. It is

essential that one understands the purpose of

the analytics in order to get the answers. You

Figure 5: Some analysis techniques Ref. [6]

6. Mathematical Modeling

This is the main step wherein the domain

expertise comes into play. A model of the

system is created which depicts the

relationship amongst variables. This model

can then be used to predict the future

variables, which will be useful to make

recommendations. This is the part where

techniques such as machine learning,

clustering data etc. come in to picture. Mostly

this section comes under “Data analytics part”;

the previous steps can be regarded as part of

data analysis. Let us see an example of a

model that will predict the sale of a vehicle

based on different parameters across

geography.

Figure 6: Decision making model

vehicle, data from online transaction, data in

government records, and so on. In many of the

cases, the data obtained could be incomplete,

raw and unstructured. Examples of such data

are missing records, incomplete forms, data

captured using various sensors at different

locations, etc. We need to organize it to make

sense. This constitutes the next step.

4. Cleaning of data

Tidying up the data is a major task in order to

successfully and correctly analyze the data.

We might have multiple data sources and we

might need to augment the data collected

based on these sources. Even with the best

analysis techniques, the results might deviate

in a different direction due to erroneous/junk

data and will subsequently produce

misleading results. In order to yield value out

of the data, it is essential that data that matters

should be retained. Various steps involved

during data cleaning are shown in figure 4. If

the data is gathered from multiple types of

resources, they need to be merged. If some of

the data is missing, it has to be built up by

mathematical techniques like interpolation,

extrapolation, etc. The data also needs to be

normalized and duplications needs to be

removed. Ones the data is ready, it can be

used further for analytics purpose.

need to be able to realize when you might

require more data points to support or discard

your analysis. In fact, you may never know

when you may discover some very

unexpected and useful insights during your

course of analysis. Various tools and

techniques can be used to analyze the data in

hand. Initial analysis can be performed using

techniques such as multi variate analysis,

linear regression etc. In figure 5, some

analysis techniques are shown. We will

discuss about some of these techniques in

next couple of articles.


8

The parameters that influence the vehicle X sales can be current trends, the infrastructure available and the economy situation of the region. Another factor that might influence the decision of the customer is the government regulations, for e.g. if the government decides that electric car or hybrid car buyer will get some benefits in taxes, people might opt for that. Thus, by taking these factors into consideration, we can generate a model of the system based on the gathered knowledge and using techniques such as linear regression, MANOVA etc. Then we try to get answers to our questions, in this case, figure out how each parameter will affect the sale of vehicle X.

7. Interpret and optimize results recursively

Once you get the results, you must be able to find out the difference between what did you expect form the data and what you see in the results. The results might be true for one set of

V. References

1. “Data”, Wikipedia, available at http://en.wikipedia.org/wiki/Data

2. Image available at http://www.ncon.com/images/rawdata.gif

3. Fari Payandeh, “BI vs. Big Data vs. Data Analytics By Example”, article in Foreground Analytics—Big Data Studio

4. Coursera course on “Core Concepts in Data Analysis”

5. Image available at http://www.datacleansing.net.au/Images/Data_Cleansing_Cycle_350px.jpg

6. Image available at http://www.rnbresearch.com/gifs/data-analysis-services.gif

data but they might vary in the other; it is essential that the variables are optimized regularly based on the most recent data.

Data analytics is about being able to push through lot of difficulties that we face when we deal with either large or messy data, that includes collecting the data, cleaning them up and coming up with various analysis techniques that exploit new information from the data. It is aquest to discover new insights from the past data. It is a combination of functional expertise, traditional research, knowledge of mathematics & statistics, and machine learning. It also requires a knack to explore different sources and figure out the answer by any means possible. In conclusion, data analytics acts like the eyes and ears of an organization that wishes to support and enhance its decision taking ability using data.

IV. Conclusion

Humor and Statistics

l

l

l

l

l

If you live to be one hundred, you've done a wonderful job. Very few people die past that age.

Two statisticians were traveling in an airplane from New York to LA. After an hour, the pilot

announced that they had lost an engine, but there are three left. However, instead of 5 hours it would

take 7 hours to get to New York. A little later, he announced that a second engine failed, and they still had two left,

but it would take 10 hours to get to New York. Somewhat later, the pilot again came on the intercom and announced

that a third engine had died. Never fear, he announced, because the plane could fly on a single engine. However, it

would now take 18 hours to get to New York. At this point, one statistician turned to the other and said, “Gee, I hope

we don't lose that last engine, or we'll be up here forever!”

Statistics plays an important role in genetics. For instance, statisticians can prove that number of offsprings is an

inherited trait. If your parent didn't have any kids, odds are that you won't have either.

It is proven that the celebration of birthdays is healthy. Statistics shows that those people who celebrate the most

birthdays become the oldest.

One day there was a fire in a wastebasket in the office of the Director of Sciences. A chemist, a physicist and a

statistician rushed in. The chemist works on which chemical agent would have to be added to the fire to prevent

oxidation. The physicist also starts to work on how much energy would have to be removed from the fire to stop the

combustion. While they are doing this, the statistician starts setting all other wastebaskets on fire. "What are you

doing?" others ask. The statistician replies, "Well, you definitely need a larger sample size to solve the problem."


9


10

Ramaraju Indukuri

Areas of InterestProgramming Languages,and Data Mining

About the Author

Data AnalyticsToolbox


11

I. Introduction

Quick googling will give you a whole lot of information of different analytics platforms. Hence instead of going into specific tools and the comparisons, let us discuss what analytics is and use the most popular open source analytics tool R, to demonstrate some of the key elements of analytics.

To briefly mention, the most popular analytics tools are Matlab, SAS, SPSS, and R. These are mostly used as a desktop package by customers with their capacity limited by RAM, though the vendors have been coming up with server versions with some features that enable them to be linearly scalable(add more computers to improve the performance). These packages are essentially mathematical or statistical packages and mostly used by engineers or statisticians.

However, lately with the explosion of big data, platforms like Mahout on Hadoop have been increasing in their popularity. These require fair amount of programming skills and an elaborate setup, which will be a big barrier for traditional engineering and statistical community. The term data scientist in my view, came up due to the fact that these packages require not just statistics and machine learning skills, but also skills for big data platform understanding and domain knowledge.

A quick note on big data platforms- Big data platforms enable companies to not having to sample data and depend heavily on wide confidence intervals. One customer from a large bank in US, who manages her target marketing team, mentioned that their accuracy improved by 25% when they moved to machine learning using big data platforms from standard statistical analysis. It is noteworthy, that she also said, as a business user, she was highly skeptical about the big data platforms in the beginning. However, once she saw the results, she has been a big supporter of their Hadoop project.

As mentioned earlier, skills and efforts required in doing analytics using big data are significant. The current Hadoop platform requires an elaborate programming to accomplish same thing that an R package can do in few lines of code. It is expected, over next few years, there will be a significant improvement in programming frameworks on big data that will enable customers to perform analytics much quicker and faster. One noteworthy mention towards this direction is “Spark”, which uses “SCALA” programming language to simplify Mapreduce code (A programming algorithm that parallelizes computational tasks over several computer servers), though it introduces a new

programming language which traditional customers may not want to adopt given the scarcity of people.

R is an open source package that has been taking over the field of data analytics. Primary reasons are its ease of use, rich packages that abstract users away from underlying mathematical complexity and its zero cost. The key downside for R is its dependence on RAM of the machine, which restricts the size of data set it can handle. However, there are several packages developed to parallelize R and leverage linearly scalable platforms like Hadoop.

Before we jump into the problems customers solve using analytics, let us understand the meaning of analytics. There are three levels in data analytics: Data Visualization, Statistical Analysis and Machine Learning.

Using charts and smart graphics, visually present the data so that a human eye can recognize patterns in it. Most of the time, visualization will guide us to further conduct deeper analysis. Most companies have heavily invested in data visualization using their Business Intelligence systems and use these systems to gain understanding of their corporate data. As you all know bar graphs in excel are most widely used 'Analytics' done by most of us on a daily basis.

However, with the advent of various tools and technologies, visualization has advanced a lot since the bar chart. Word clouds (see figure 1), interactive network diagrams (A diagram with network of nodes and edges which depict relationships, for example a social network), heat maps (A diagram that shows area density depicted in shades of colors) have become more popular and are getting into mainstream. There is more and more emphasis by tool vendors in incorporating infographics into their packages.

II. Data Visualization

Figure 1: Data cloud


12

Figure 1 illustrates a word cloud of President

Obama's 2009 speech. It quickly shows the

words he used frequently in his speech. This

can be accomplished using following script in

R.

Wordcloud (INPUTTEXT, min.freq =

WORD_FREQUENCY, random.order =

FALSE)

As you can see, building a word cloud is fairly

simple, since R has a pre-built package. This

is true for most algorithms.

The key challenge that you will quickly notice

when you try to analyze data is the data

preparation stage. It is a well-known fact that

80% of the effort in analytics is to massage the

data in a format acceptable to whichever tool

you use. Vendors like Oracle have integrated

their data mining tools with the Extract,

Transform, Load (ETL) tools, to simplify the

process of preparing the data. However, the

challenge is that the data scientists (or R

programmers) typically are not interested in

deploying another tool set and tend to use the

packages they are already conversant with.

This is a large opportunity for IT companies in

the coming years to help customers prepare

the data that data scientists can easily

process.

There are several other data visualization

platforms. Within R, ggplot package is most

popular for its versatility and ease of use in

generating all standards statistical plots. The

beauty of ggplot is its ability to generate the

graph over several layers, which significantly

de-clutter the process of developing charts.

Figure 2: Example of graph generated by ggplot

Graph shown in figure 2 can be produced by a

simple statement

Where 'stateincomes' is a table of states, their incomes along with latitude and longitude values of each state.

Figure 3. Example of Document Object Model

Visualization will not be complete unless we

touch upon d3.js, the latest tool kit in the area

of data visualization. Companies like

D a t a m e e r h a v e c r e a t e d s t u n n i n g

visualizations on top of big data platforms

using d3.js. d3's power lies in a combination of

simplified use of DOM(Document object

Model), Scalable Vector Graphics and CSS

(Cascading style sheets).

Document object model is a model that

represents structured documents like HTML

and XML (as shown in figure 3). It lets

developers access the elements within a

document like html. For example, in the java

script included in the page, to change color of

the text of an element called var1, you can use

DOM in the following way.

document.getElementByClass('var1').onmouseover.style.color = "#0000ff;';

Where as if you include the d3.js libraries in

the head, you can do the same as follows

D3.select('var1').style('color','blue')

As you can see, d3 makes navigating the

HTML document much more elegant and

simple. Now, SVGs, or scalable vector

graphics are diagrams that can be drawn with

java script and do not require gif or bit map

files. This allows developers to create nice

graphics without having to tax the network with

picture downloads. For example, to create a

circle, one can use the following

D3.select('var1').append('svg').append('circle').attr(cx:10,cy:10,r:10,fill: 'green'})

D3 has functionality that enables reading data

(JSONs or arrays) and using it to generate

charts and graphs that can be impactful for the

user, without compromising on system

performance.


13

III. Statistical Analysis

Companies have been using statistics to sample data and derive meaningful conclusions for ages. Packages like SAS, SPSS have functionality that simplifies statistical analysis for business users and engineers. While the most popular statistical analysis tool used is excel, there are dizzying array of tools available for industry to use.

To illustrate what kind of analysis can be done using statistics, take the example of a car failing due to a break pad problem. Graph in figure 4 calculates and illustrates the survival probability over time for 3 different models of engines. (Note that the failure stabilizes at about 65% of the survival rate, not because there aren't any more failures after 720 days, but because the warranty has expired after 2 years (After which car manufacturer will not entertain any claims or it is current date).

Figure 4: Graph generated in R using Shiny

(A Web framework on R).

Statistical analysis is used by different users in

different ways. For example, Oracle Demantra

has several statistical algorithms built into its

demand forecasting functionality. SAS has

Survival (Weibull models) analysis available

for warranty and reliability groups.

Statistical analysis tools can be classified as

shown in figure 5:

Figure 5: Classification of Statistical Tools

As you can see, the options for statistical

analysis and programming are huge. It is very

common that companies invest in more than

one tool to do statistical analysis, since

different tools fit different applications. For

example, SAS promises special features like

Weibull analysis and survival curves in the

area of warranty and is widely used for it.

Whereas, SPSS is focused on simpler user

friendly analysis that typically sits on top of

existing application data bases.

Notable mention is R. Companies can benefit

by significant number of R trained

programmers and statisticians passing out of

universities and its collaborative developer

community that keeps adding newer

packages.

As per Wikipedia, Machine learning, a branch

of artificial intelligence, concerns the

construction and study of systems that can

learn from data. In simple terms, based on the

available historic data, machine learning

algorithms can build models which can be

used to predict an outcome.

There are two kinds of machine learning,

supervised and unsupervised. For example, if

you provide a set of historic customer financial

data and 'tag' those who defaulted, a machine

learning model can be built, which can be

applied to new set of customers and

determine whether they are going to default.

This is supervised learning. In unsupervised

learning, you will not tag, but the algorithm

automatically learns and classifies. For

example, if I provide customer data, algorithm

can classify them into set of clusters, say for

example urban young males vs. rural baby

boomers etc., which you can interpret after

seeing the results.

Most of the statistical analysis packages come

with machine learning features as well. But

there are few pure play packages, notable one

being Rapidminer. The advantage of

specialist machine learning tools, is that they

provide an easy way to cleanse and prepare

the data that the machine learning algorithms

can readily accept. The specialist tools can

also help determine the accuracy with ease

and thus, help users improve the framing of

the problem.

IV. Machine Learning


14

The other notable platform for machine

learning is Rattle, a visual GUI that runs on top

of R and written in R. It provides basic but

quick way to explore smaller datasets and

build models.

However, these tools usually do a decent job

till the data reaches certain limits, beyond

which they are not scalable. Apache Mahout,

an open source big data machine learning

module can scale the complex algorithms that

cover most commonly used situations, both

supervised and unsupervised, with liner

scalability, .i.e. unlimited data capacity.

Companies can use tools like R and

Rapidminer to prove the algorithms and then

deploy them on to big data platforms as an

ongoing standard solution. Companies

have already started using PMML (Predictive

model Markup Language) to create

the redeployable models and they are

g a i n i n g i n p o p u l a r i t y . ( V i s i t

http://en.wikipedia.org/wiki/Predictive_Mode

_Markup_Language for more details).

V. Other Notables to Watch

“Spark” is a component of UC, Berkeley's big data stack, which is rapidly gaining popularity. Its power lies in leveraging memory (Using the concept of RDD) instead of depending on hard disk. RDD, or resilient distributed data set is a memory abstraction provided by spark, where in large amount of data is stored in memory across multiple servers, and can even spill over to disk if there is insufficient RAM. This makes spark particularly suitable for iterative algorithms and interactive data. Spark is proven to be multiple times faster than standard Hadoop implementations and yet can run on commodity hardware and provide linear scalability. Another advantage of spark is that it provides a consistent programming platform using Scala, java and python, though Scala fits the best. Spark can be deployed on top of Hadoop, leveraging Hadoop's distributed file system. Spark provides machine learning and streaming functionality that can serve different use cases. The major challenge for traditional companies is the skill set availability.

“Mesos” is a resource manager, again built at UC Berkley that can distribute data processing work among the servers in the clusters that can run multiple frameworks. Imagine a company running racks of Hadoop or Cassandra as well as Spark and MPI. Pre Mesos, companies had to invest in separate racks of servers for each of this framework

(though it is commodity hardware and is multiple times cheaper than your traditional HP/Oracle or IBM servers). With Mesos, companies can combine all the clusters together and use them fine grained, (at CPU Cores and RAM level) based on processing loads.

Figure 6: Mesos along with other architectures

Mesos does this by providing application

programming interface (API) to integrate

frameworks and implementing a scheduler on

frameworks. Mesos slaves installed on all the

servers, talk to master and 'Offer' CPUs and

Memory, to which mesos will respond by

allocating tasks to them after talking to

frameworks and getting the tasks from them.

As you can imagine, this will optimize the

utilization of servers tremendously and

provide high fault tolerance. Hadoop 2 (Yarn)

has implemented this mechanism. Mesos has

special advantage of easy integration and

multiple frameworks already adopting to it.

The area of analytics is fast evolving. You will

find most of these tools may change or evolve

in next 2 to 5 years. However, what is constant

is the skills required to understand and use the

data; this is going to remain the same. Good

news is that new platform vendors are coming

out with tools that are easy to use, provide

support for big data and provide machine

learning capabilities. Hence customers should

consider these new innovative toolsets,

instead of going with established providers.

KPIT can help you with making these choices

and help support those platforms.

VI. Conclusion

VII. Bibliography[1] http://cran.r-project.org/

[2] http://d3js.org/

[3] http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

[4] https://spark.apache.org/docs/latest/index.html

[5] http://rapidminer.com/

[6] http://mesos.apache.org/

[7] http://www.datameer.com


15


16

Sushant Hingane

Areas of InterestMathematical Modeling andSimulation, Control Systems.

About the Author

Role of Mathematicsin Analytics


17

Ref. [1]

II. Categories of Mathematical Models

A. Linear vs. NonlinearIn simplified words, if for a system, all the objective functions and the restraints can be expressed in terms of linear equations, it is considered as a linear model. Otherwise, it is a nonlinear model. The linear model can be written as Y=XB+U where Y is the system output; X is a matrix that is a design matrix or the system input which is the independent parameter; U is the matrix containing errors or noise and B is the matrix containing internal parameters of the system. Once B matrix is calculated or estimated, one can interpolate the missing data between two data instances.

Figure 1 : Straight line fitting [2]

For 2-dimensional data, the first step is to plot the data points on X-Y plane. This gives a fair idea if the system is linear or nonlinear. This is a simplest method to figure out if a straight line could fit to the points or a curve.

l simplistic type of data analysis. The objective is to identify the degree or the order of the system equation and then find the system parameters. The image below shows general patterns of some of the data plots in 2 dimension. (Courtesy [3])

Polynomial fitting is very common and

l - texpressed as an exponential model, N= N e o

Where λ is the decay constant which can be calculated using several data points. N &N are o

the initial radioactive mass and mass at time t, respectively. The half-life of a radioactive element, in terms of λ, can be predicted by equating N as half of the initial mass, N .0

Radioactive decay of an element can be λ

B. Continuous vs. Discrete

Continuous model allows the system state to change any time, unlike the discrete system where the states change with a specific interval. The important analysis in this type of modeling is time series analysis, where data is constantly received with a fixed time interval. The purpose is to yield meaningful statistics and other characteristics of the data as well as the time series forecasting to predict future values.

Time series models can be classified into three broad categories;

a) Autoregressive model (AR) where the value of the output variable linearly depends on its previous value.

b) Integrated models (I) that include a differencing operation using a back-shift


18

I. Introduction

A mathematical model is a representation of any real world problem or system into a simplified or abstracted equation form in order to analyse the traces of the system and behaviour to predict the future values. This turns out to be a blessed miracle when we have huge chunks of data or 'big data' going in and out of a system. The challenge lies in finding the exact formula for a better correlation, or minimal-error fit for the concerned data. Big data could be noisy, unstructured, dynamic, or even incomplete. A model should make use of the data in the form of multidimensional vectors, time series of samples, or huge matrices. Considering this, sometimes it is wiser to use the ready made models that have been proven to be fit instead scratching your head all day! The right choice of model for the data is very important since there can potentially be many models that would fit the data. However, depending on the data availability and error tolerance level, one can make an informed decision. In this article we are going to see how.

Modeling of a given system can in general be represented as Where Y is the output, X is the independent variable and B is system parameter. Most of the time, the mathematical modeling process boils down to estimating the value(s) of B. Let us discuss some of the techniques that are used in real world scenarios.

C. Probabi l ist ic (Stochast ic) vs.

Deterministic

Deterministic models comprise of the system

dynamics or system evolution that has no

randomness involved. Every set of the state

variables can be uniquely determined by the

model parameters and its previous states.

Stochastic processes, on the other hand, take

care of the randomness of the system using

One of the examples of a discrete system is the data coming from any sensor with digital output. An accelerometer, for example, that gives the acceleration values (a) with a sampling time period of 1s can be used to estimate the velocity (v) and the distance (d) as well. Using the differential equations,

and

the formulation for velocity and distance in

discrete time becomes,

v = a * t + v current current sampling interval previous

d = v * t + d current current sampling interval previous

Time (1s sampling) Acceleration Velocity Distance

0 0 0 0

1 2 2 2

2 1 3 5

3 1 4 9

4 0 4 13

5 0 4 17

6 -1 3 20

7 0 3 23

A special condition where, the mean or the

expected value and the standard deviation

is called standard normal distribution.

Figure 2 : Normal distribution

l An interesting example of normal

distribution (also known as Bell Curve

because of the shape) is a Quincunx machine

also known as Bean machine or Galton box.

When several balls are dropped in the

machine with several pins as the obstruction,

the collection of the balls at the bottom

approximates a normal distribution [5].

operator to eliminate the initial 'non-stationarity' of the data. A stationary time series is the one whose properties do not depend on the time at which the series is observed. E.g. the white noise.

c) The moving average models (MA) which as the name suggests, keeps on averaging a series of data.

Various combinations of the above models produce the models[4]such as, Auto regression integrated moving average process (ARIMA), Seasonal ARIMA process, Fractional ARIMA process (FARIMA), etc.

probabilistic approach. We will put more focus on the probabilistic models in this article. Let us discuss some of the concepts involved in stochastic system modeling process.

Probability Distribution Function

Probability distribution function of a discrete random variable is the set of probabilities associated with each of its possible values. There are some practical distribution functions such as normal or Gaussian distribution function, Binomial distribution, Poisson distribution etc. Let's take example of normal or Gaussian distribution that tells us the probability of a real observation within a certain limits. The standard deviation σ and the mean or the expectation µ contributes to the distribution function as,


19

Markov ChainMarkov chain is a discrete process system, which is random and memoryless. The transition from one state to the other state depends solely on the current state and not the previous trains of the state transition. Markov chain models prove to be perfect fit for the share market scenarios.

Figure 3 : Markov chain process

In figure 3, the numbers indicate the

probabilities associated with every possible

state transition from one state to the other.

Thus, the state transition matrix becomes,

where B isu

a set of all pages linking to u and

L(v) is number of links from page v.

Notice that the sum of row elements is equal to

1.0 since it shows the distribution. With the

help of this matrix, one can estimate the

system state from a given initial state. In

simpler terms, in the conditions where we

know the initial state of the system and we

know the number of state transitions that have

occurred, we can guess the current state of

the system with a certain probability

associated with it.

Monte-Carlo simulation, mostly used to

calculate the impact of risks and uncertainties

in various forecasting models such as

financial, project management, cost etc. This

simulation can tell us how likely the results are,

depending on the ranges of estimation.

Monte-Carlo simulation comprises of several

steps of modeling approach. First is to select

the estimation range or a domain. Second is,

using probability distribution functions,

randomly generate the data points to perform

a deterministic computation.

Monte-Carlo Simulation

l In 1908, an Englishman named William

Sealy Gosset under a pseudonym 'Student'

(to hide his employment at Guinness brewery)

developed a statistical method called

Student's t-test to compare two sets of

independently collected data. The t-

distribution is a family of curves which, as the

number of degrees of freedom increases

(number o f samp les m inus one ) ,

approximates to the standard normal

distribution. This method helps in considering

or in ruling out the null hypothesis (meaning if

the measured difference is only 'by chance')

[6].

l

l

Any board game that is played with dice, e.g.

snakes and ladders, is a perfect example of

Markov chain process since the next state is

chosen randomly (with a probability

associated to it) and does not depend on any

previous states.

Google's page rank algorithm (Image

courtesy [7]) is also a good example of Markov

chain that uses a random surfer algorithm to

rank the web pages for a user's search

request. In general the page rank value for any

given page 'u' can be expressed as,

l To simplify the Monte-Carlo method, let us take an example called raindrop experiment [8] of computing the value of with this method. In this, we draw a circle inscribed in a unit square and let the raindrops fall freely on it. Green dots are the raindrops inside the circle and red ones are outside the circle but inside the unit square.

p

So with multiple data points, Monte-Carlo

simulation can be run to find the approximate

value of pi as,


20

Figure 4 shows the basic principle behind Monte-Carlo simulation. X , X , X are the 1 2 3

uncertainty models which are randomly generated through probability distributions. The goal is to determine how random variations, lack of knowledge or error affect the sensitivity, performance or reliability of a system. The outcomes Y , Y can be 1 2

represented as probability distribution which also can be converted into error bars, reliability predictions, tolerance zones and confidence intervals.

Figure 4 : Monte-Carlo simulation [9]

III. Predicting the Future

Regression Analysis

In regression analysis [10], the output or the dependant variable is calculated as a linear combination of the system parameters.

As we have seen in section I, we try to establish a relation between the measured system outputs Y and the system input X. 0

and are the parts of B matrix and will be 1

calculated/estimated with experimentations. Let us assume that and are the

estimated parameters calculated using least square approximation. Let us assume that the error in calculating the output is,

ßß

We pick values of (and so forth) that minimizes the square of the error The linear regression is a very effective tool for data interpolation, which will give future estimate within the said error tolerance band.

Kalman Filter

One of the best examples of data estimation is Kalman filter [11]. Kalman filter finds its applications in missile tracking, navigation, economics, sensor data fusion, etc. Basically this technique is used to estimate unknown states of the system with minimum uncertainties. Word 'filter' suggests that, it takes noisy input but does not let the state estimation be affected by it. This technique predicts the unknown parameters of the system with better accuracy. It takes a series of measurements with modeled noise and then recursively takes the weighted average of the predictions.

and

Figure 5 : Kalman filter

IV. ConclusionMathematics is a powerful tool to subdue the quantitative and probabilistic components in any data analysis and forecasting problem. We can, with some level of confidence, not be intimidated by the buzzwords such as 'data analytics' or 'big data' and only focus on the essence of them to serve the objective. However, it remains a challenge for the designer to estimate the system behavior with a minimum estimation error. Using currently available techniques, one can foresee the future with some “likelihood” associated with it.

l Monte-Carlo simulation models are extensively used in financial data modeling in order to evaluate and analyze the financial instruments, portfolios and investments. This helps in risk evaluation, portfolio optimization and financial planning. The method is used to simulate various sources of uncertainties that affect the portfolio in question, using randomly generated uncertainties (mostly by probability distributions) and calculate representative outcome of the analysis.

References[1] Article: “Examples of Financial Applications”

Available: “http://www.lancs.ac.uk/~jamest/Group/finance1.html”

[2] “Linear regression”, Wikipedia,

Available: “http://en.wikipedia.org/wiki/Regression_analysis”

[3] Lecture notes “Numerical Methods:curve fitting techniques”,

Available: “http://kobus.ca/seminars/ugrad/NM5_curve_s02.pdf”

[4] Article “Introduction to Time Series Analysis”, Gurley

Available: “http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm”

[5] Wikipedia “Bean machine”

Available: “http://en.wikipedia.org/wiki/Bean_machine”

[6] Britannica encyclopedia “Student's t-test”

Available: “http://www.britannica.com/EBchecked/topic/569907/Students-t-test”

[7] Wikipedia“Page Rank” Available: http://en.wikipedia.org/wiki/PageRank

[8] Lecture notes, “An introduction to Monte carlo methods”,Ioana A. Cosma

and Ludger Evers, Available: “http://users.aims.ac.za/~ioana/notes-ch2.pdf”

[9] Article, “Monte-Carlo Simulation basics”,

Available: “http://www.vertex42.com/ExcelArticles/mc/MonteCarloSimulation.html”

[10] Wikipedia, “Regression analysis”

Available: “http://en.wikipedia.org/wiki/Regression_analysis”

[11] “Kalman Filter”, Wikipedia

Available: “http://en.wikipedia.org/wiki/Kalman_filter”


21


22

Shiva Ghose

Areas of InterestControl Systems,Machine Learningand Mechatronics

About the Author

Extracting Big Moneyfrom Small Money:A Look at Data Analysis forMicro-Financial Systems


23

Abstract

I. Small lending and big ideas

There has been a lot of debate over whether

microfinance institutions are the poverty

eradicating panacea they claim to be.

Microfinance aims at providing economic

stimulus for rural communities by giving them

access to financial tools and resources. By

doing so, they fill the gaps that mainstream

financial institutions have traditionally left

behind, however, they deal with risks that are

extremely hard to quantify through

conventional methods. This complexity

inadvertently reduces their efficacy in helping

the poor.

This article addresses risk analysis in

microfinance systems – who are the key

players, why do they want to analyze risk, and

how do they do it? Starting with a little

background on microfinance, we move on to

how upcoming microfinance institutions are

driving a new wave of big data analysis in

order to reach out to the masses and provide

monetary stimulation. This topic is particularly

interesting given the hit that major financial

institutions took during the global economic

slowdown of 2009, the semi-resurgence that

we are experiencing now, and the fact that

data analysis has never been more reachable

to masses. That being said, it is important to

remember that data analysis alone is not a

substitute for cautious investment from the

lenders, and the ability to stay within your

means as a borrower.

Microfinance has its roots in rural communities

and can be traced back to almost the start of

currency usage. Without direct access to

banking or money lending organizations, rural

communities resorted to non-traditional

methods to generate capital. The 1960s and

1970s saw the advent of modern microfinance

led by Nobel laureate Muhammad Yunus, who

started the Grameen bank in Bangladesh.

When combined with the general push by

governments around the world to fund

marginal farmers, disparate microlending

initiatives evolved into well-defined communal

lending schemes. These micro enterprise

lending programs had an almost exclusive

focus on credit for income generating

activities, targeting very poor and often

women borrowers.

Early adopters of microfinance saw very

encouraging results – in many cases, rural

entrepreneurs were able to develop

businesses, and bolster local economies.

Borrowers, especially women, were able to

repay loans and set up sustainable

livelihoods. While initial success stories of

microfinance painted a very optimistic picture;

microfinance institutions often faced a very

difficult time assessing the risk involved in

dealing with rural communities.

By the late 1980s and 1990s, studies on the

impact of microfinance showed that many of

the people that the system aimed to help were

still unable to lift themselves out of poverty .

Other studies found that microloans were

most beneficial to clients who were above the

poverty line. These clients generally had the

stability to take bigger risks which usually had

better payoffs. Poorer borrowers often used

loans to sustain their lifestyle which caused

debt spirals and eventually led to loan

defaulting. It turns out that you have to be rich

in order to be poor.

Big banks and other financial institutions have

developed and used detailed models of risk

assessment for decades. These risk

assessment models allow them to ascertain,

with some accuracy, the probability of a client

returning a loan. However, models of risk that

the bigger institutions used usually fell apart

when trying to deal with rural groups. Poor

communities do not possess liquid or

immobile assets, the ability to repay high

interest loans, or records of repayment fidelity.

Adding difficult last mile access problems as

well made poor communities inaccessible to

banks. Microfinance groups which operated

only in the last mile could also reap non-

monetary benefits from funding a rural

enterprise through the invigoration of local

economies, and could use this value

generation to offset high interest rates. That

being said, microfinance groups could not

easily discern who could repay the loans

amongst their poorer clients. A direct result of

this was that microfinance institutions started

catering towards the better off rural sections,

adjusted their interest rates accordingly, and

left the poorest members behind.

Even though most micro-financial initiatives

are designed to be non-profit in nature, they

are certainly looking to be a sustainable

business and not incur losses. A key issue that

comes up is to determine which customers will

be able to pay back their loans and in the

The plot thickens

Risky business

II. Process for progress


24

process contribute to the society effectively.

Unlike their big-money counterparts,

microfinance companies cannot make

generalized risk models which are applicable

across societies. However; thanks to

technological progress, the barrier to entry for

specialized and powerful number crunching

has become dramatically affordable. When

this is combined with the influx of

communication technology into rural India, we

get an unprecedented ability to connect with

low income societies through powerful

computing and communication based

technologies.

Rural borrowers do not possess the data

points, such as at credit history, and

documented assets, that financial institutions

have traditionally used in order to assess

repayment likelihoods. This lack of credit

information does not mean that the poor have

bad credit. This is where big data comes in –

modern micro-financial institutions are now

looking at utilizing the vast quantities of

disparate data that rural people generate in

order to build specialized, local risk models.

Take Vodafone's Chota Credit as an example

– the company analyses a prepaid user's

usage, regularity of payments, recharge

amounts, phone usage statistics, and can

issue nanoloans in the form of credit to the

user's balance from 10 rupees up to 197

rupees. Apart from telephonic connectivity,

modern cell phones give people easy access

to online platforms as well. Companies like

Kabbage scrutinize online profiles to

determine credit worthiness, while companies

like DemystData generate their own data by

Correlations in unlikely places

1. This is because they usually needed a loan in order to support a lifestyle that their existing income could not already support.

2. As we will see in the following sections; a financial model is only as good as the assumptions and linearizations it makes. This was one of the primary causes of the 2009 credit defaulting crisis.

3. Companies like Amazon provide cloud services that start for free! (http://aws.amazon.com/free/?sc_channel=PS&sc_campaign=AWS_Free_Tier_2013 )

4. A United Nations report suggests that Indians have better access to cell phones than toilets.

The Mahalanobis distance

How can we compare a client who has three children

with a client who recharges their phone for 40 rupees

twice a month, or another client who is a woman? Each

of these data points are as different as the features they

quantify. Professor P. C. Mahalanobis introduced a

scale-invariant distance metric that allowed for the

comparison of data-sets by looking at the statistical

variance of the data . By removing the scale of the data

and comparing values based on their statistical

properties, such as offset from the mean of a datum and

the variance of the population, the Mahalanobis distance

allows us to semantically gauge binary features, discrete

features, and continuous features with each other. This

lets us compare apples and oranges!

administrating real time tests to determine

risk. Other companies look at pulling

correlations from government statistics,

household compositions, as well as utilities

usage.

By pooling together all these non-traditional

data points, companies can try to paint a better

picture of how rural risk can be characterized.

We have talked brief ly about how

microfinance groups can get information on

clients, and from which sources; now let's look

at how they can process this data. Using these

data points, non-linear decision boundaries

will have to be designed which can break up

clients into various credit ratings. The first step

in this process is to make the data

comparable. This consists of two parts: first

we normalize each feature in amongst its

peers, and then we look at a statistical

distance metric instead of a raw distance

metric (a good example here is the use of the

Mahalanobis distance ).

Clusters, curves, and plain old hyper-planes

Now that we have a normalized feature set, we

can look at de-cluttering the data. What we are

really looking for are features that are strongly

correlated with a person's risk rating. Methods

such as principal component analysis can be

used to determine which features are strong

indicators of risk, and which features are

merely noise.


25

Principle Component Analysis (PCA)

PCA is used to understand and easily

visualize which features play an important role

Figure – This is a trivialized example of how

we can use PCA to simplify a model. In sub-

figure (a), feature 1 plays a slightly bigger role

in governing the spread of the data points than

feature 2. However, the relative size of role

played by feature 2, with respect to feature 1

means we cannot afford to ignore either

feature without a substantial drop in the

model's performance.In sub-figure (b), both

Using the good features, we can now try to find

decision boundaries which can help us

classify new clients. There are numerous

machine learning tools that we can use for the

job, but broadly we can use either:

Supervised learning – where we can tell the

algorithm about the credit rating for each client

based on previous records, or

l

Figure 2 : Forms, social media, cell phones can be sources to determine credit rating.


26

in determining distribution of the data. This

gives data-scientists deeper modeling

insights and helps them make better choices

when pruning data sources. shows an

example of how PCA can be used.

features have about the same control over the

data spread. In the third sub-figure,

(c1),feature 1 predominantly controls the

spread of data. This isparticularly exciting

because we can drop the second feature to

reduce the complexity of our data distribution

model for a relatively low loss in data-fidelity

as shown in (c2).

l

can look into learning relationships on their

own.

Complex data sets can be handled in the cloud

using distributed statistical tools.

Unsupervised learning – where algorithms

The final result is a decision boundary which

we can use to determine the credit worthiness

of an individual. When a new client comes in,

based on the person's feature set, we can then

make a better assessment of the person's

ability to pay back a loan.

The main goal of microfinance initiatives is to

give low income sections of society the

opportunity to access financial tools in order to

help them out of poverty. Better risk analysis

can help these financial institutions reach out

to even the poorest groups in rural

communities. Thanks to the mainstream

adoption of big data analytics, improvements

in communication, and the gaining popularity

of machine learning tools, cheap and easy

access to these resources has never been

III. Looking past the numbers

[1] R. Krieger, "The Evolution of Microfinance,"

[Online]. Available:

http://www.pbs.org/frontlineworld/stories/uganda601/hi

story.html. [Accessed 04 May 2014].

[2] M. Bateman, Why Doesn't Microfinance Work?,

Zed Books, 2010.

[3] D. Hulme and P. Mosley, Finance Against Poverty,

London: Routledge, 1996.

[4] A. Karnani, "Microfinance Misses Its Mark,"

Stanford Social Innovation Review, Summer 2007.

[Online]. Available:

http://www.ssireview.org/articles/entry/microfinance_mi

sses_its_mark. [Accessed 07 May 2014].

[5] T. Pratchett, 'Captain Samuel Vimes 'Boots' theory

of socioeconomic unfairness' - Men at Arms, Victor

Gollancz, 1993.

[6] G. Dionne, "Risk Management: History, Definition

and Critique," CIRRELT, 2013.

[7] R. Kühn, "Risk Modeling," King's College London,

23 April 2014. [Online]. Available:

http://www.mth.kcl.ac.uk/~kuehn/riskmodeling.html.

[Accessed 24 May 2014].

[8] Vodafone India Limited, "Prepaid Plans," [Online].

Available:

https://www.vodafone.in/pages/prepaid.aspx.


[9] D. Taylor and M. Schlein, "How Big Data Can

Expand Financial Opportunities For The World's Poor,"

Forbes, 25 April 2014. [Online]. Available:

Bibliography

http://www.forbes.com/sites/realspin/2014/04/25/how-

big-data-can-expand-financial-opportunities-for-the-

worlds-poor/. [Accessed 03 May 2014].

[10] J. EKSTROM, "MAHALANOBIS' DISTANCE

BEYOND NORMAL DISTRIBUTIONS," UCLA

Department of Statistics, Los Angeles.

[11] P. C. Mahalanobis, "On the generalised distance

in statistics," 1936.

[12] F. Salmon, "Recipe for Disaster: The Formula

That Killed Wall Street," Wired Magazine, 23 February

2009. [Online]. Available:

http://archive.wired.com/techbiz/it/magazine/17-

03/wp_quant?currentPage=all. [Accessed 04 May

2014].

[13] A. Hollis and A. Sweetman, "Complementarity,

Competition and Institutional Development: The Irish

Loan Funds through Three Centuries," 1997.

[14] Global Envision, "The History of Microfinance," 14

April 2006. [Online]. Available:

http://www.globalenvision.org/library/4/1051.


[15] S. Rutherford, The Poor and Their Money, 2000.

[16] United Nations University, "Greater Access to Cell

Phones Than Toilets in India: UN," United Nations

University Media Relations, 14 April 2010. [Online].

Available: http://unu.edu/media-

relations/releases/greater-access-to-cell-phones-than-

toilets-in-india.html. [Accessed 07 May 2014].

easier. The ability to have a methodical impact

in the last mile is now within reach.

However, risk analysis is only as good as the

underlying assumptions made during the

modeling process. It is important to keep these

assumptions in mind as rural risk modelling

systems develop. The financial crisis of 2007-

2008 is a good example of how indiscriminate

use of mathematical functions with disregard

to the foundations upon which they were built

can have serious consequences.

The need of the hour is judicious invigoration

of rural economies to stimulate growth. Long

term stability can come about by creating jobs

which not only provide adequate monetary

compensation for the poor, but also

opportunities to grow.


27


28

Vaishali Patil

Areas of InterestEmbedded Systems Designand Analysis

About the Author

Data Analytics forWarranty Claims


29

I. Introduction

As a customer, when you buy a new product

like watch, mobile, TV, Play Station, or a new

CAR, what do you expect? I guess good

quality product with less price, less

maintenance cost and good reputation of

company in market. Suppose, you want to

purchase a mobile and you have two options.

One costs Rs. 5000 with 1-year warranty

period and another costs Rs. 6000 with 2

years warranty period, and you plan to use this

mobile for next 2 years. Then buying the

mobile with 2 years warranty would be

cheaper considering zero maintenance cost.

Warranty is legal assurance from the

manufacturer that if a product fails to function

as expected in warranty period then it will be

repaired or replaced by company, and

customer will not have to pay for it. In warranty,

one can't claim money refund like guarantee, if

the product fails to function as expected. Have

you ever thought of how companies decide the

warranty policies and which factors are

considered while deciding the policies? Why

don't all manufacturers simply give lifelong

warranty to attract customers and increase

product sell?

Being a product manufacturer, one should

always consider product reliability, customer

satisfaction, profitability, and competition.

These major factors influence overall

business management and strategies. These

factors need to be considered in warranty

policy and hence designing warranty policies

is a very complex task.

Current state-of-art includes data analytics of

warranty claims and data analytics of failure

data[1]. Failure data is gathered by doing

testing at manufacturer site. Generating all

test cases for field testing is very expensive

and not always feasible. Warranty claim data

is actual field data submitted by customer and

it includes information related to the time to

failure, type of failure, failure part, etc.

Manufacturer can analyze this data and find

out the cause of failures, predict the future

failures and take corrective action in advance

to minimize future warranty claims and can

save on warranty reserves. Subsequent

sections will give brief overview on warranty

claim process, Warranty Data Analysis (WDA)

methods, applications of WDA, and popular

WDA tools.

Let us see an example to understand how a

warranty claim gets processed. Suppose that

a customer buys xyz company's mobile with 6

months warranty and after 2 months customer

starts facing some problems. As the product is

in warranty period, customer goes to the

dealer from where he/she had purchased the

mobile. Dealer asks the customer to fill

warranty claim form with the details including

model number, date of purchase and

compliant details and so on. Technician at

dealer side investigates the problem. If it is

repairable at dealer's side, technician will

charge the repairing cost to the manufacturer.

If some part is damaged, dealer will place

replacement order of particular part to the

manufacturer. If the mobile is not repairable at

dealer's side, dealer will send it back to the

manufacturer for replacement.

In this entire process manufacturer is charged

with either repairing cost or replacement cost

(for some part or the entire piece). However,

the process does not stop here; now,

manufacturer has received very valuable

information like failure description, model

number of product, batch number, and

solution employed to fix the problem. How can

the manufacturer use this information?

Suppose, same type of problem is reported for

a particular batch, then the manufacturer can

use this information to find out whether the

problem is with the production line or

parts/techniques used during manufacturing

of the batch. If the problem appears to be with

the production line, manufacturer can upgrade

the production line and can avoid future

warranty claims. If the problem is with

par t i cu la r component used dur ing

manufacturing then manufacturer can claim

that amount to the supplier of the defective

part. If the problem is severe, manufacturer

can notify the customers who are using same

products who may experience similar problem

in near future. Manufacturer can also call back

such products and fix the problem in advance.

This will help to maintain company reputation

and customer satisfaction as well. However,

main challenge in this entire process is to

analyze the received warranty data.

II. Warranty Claim Process


30

Warranty data is usually submitted by the

customer in the form of textual description of

the problem. This data is present in both

structured and non-structuredformat. Usually

submitted warranty data is coarse data , and

this is because of many reasons such as delay

in the reporting, aggregated data received

from the vendor, some information is missing

or vague and so on. Warranty Data Analysts

are facing challenge in analyzing such coarse

data which is of poor quality and extracting the

valuable information. Different algorithms

have been developed using data mining and

text mining techniques to extract valuable

information from such data. Then predictive

algorithm can be applied on this data to predict

future claims, failure etc. With such analysis

one of the automobile and motorcycle

manufacturers were able to reduce warranty

cases from 1.1 to 0.85 per vehicle, 5%

reduction in warranty cases with an annual

savings of €30m [3].

Once the warranty data is gathered, next step

is to analyze the data. The analysis includes

following steps

Step 1: Data fitting -From the nature of data,

select the appropriate mathematical model

(some of the models are described in

'commonly used distributions' section), which

fits the data

Step 2: Estimating Parameters – Estimate the

parameters of the model which will describe

the data appropriately (How to determine

various parameters describing mathematical

model and significance of parameters is given

in 'commonly used distributions' section)

Step 3: Analysis Result – Generate analysis

results, these results can be used to predict

future failures in the field, reliability of

components, future warranty claims etc.

Table 1 shows data of cell phone failures

collected over period of 3 years for a particular

handset maker.

From gathered data, maximum phone failures

are observed during initial quarters of service.

No. of failures decrease as time-in-service

increases. This data set can be

mathematically expressed by exponential

function. Using the obtained exponential

function, future failure for quarters 11, 12 and

13 are determined as shown in the table.

III. Warranty Data Analysis

Quarter

Number observed /10000 /10000 Phones

Phones Actual Calculated

1 12.05 12.05 0

2 11.27 9.86 1.41

3 9.68 8.07 1.61

4 10.98 6.61 4.37

5 6.04 5.41 0.63

6 3.36 4.43 -1.06

7 2.59 3.62 -1.03

8 2.00 2.97 -0.96

9 1.16 2.43 -1.26

10 1.26 1.99 -0.72

11 NA 1.63 0

12 NA 1.33 0

13 NA 1.09 0

Sum of Error Square 29.39

No of Failures No of Failures Error

Table 1. Cell Phones Failure Data


31


32

Fig. 5. Effect of σ on Normal Distribution pdf [7]

3. Weibull Distribution

Weibull distribution is a versatile function,

widely used in the reliability engineering. It

takes forms of other types of distributions

depending on the value of β. Exponential

distribution is special case of Weibull

distribution where β =1, whereas, for β=3, it

takes the form of normal distribution.

Following graph shows different shapes of

Weibull distribution with changing values of β.

Fig. 6. Weibull pdf with 0 < β < 1, β =1, and β<1 [7]

3-parameter Weibull function is defined as

Where, γ:Location parameter, failure starts to

occur after γ β: Shape parameter η: Scale

parameter, increasing the η with constant β

has effect of stretching out the Probability

Distribution Function as shown in below figure

Fig. 7. Weibull pdf plot with varying values of η

IV. Applications of WDA

V. WDA analysis tools

VI. Conclusion

Information extracted from WDA can be used

by companies to plan various business and

management strategies. Some of the

widespread applications of WDA in industry

are [2]

Early Detection of Reliability Problem

WDA can give early indication of future

failures. With this information, companies can

implement solutions to avoid such failures and

thus reduce future warranty claims.

Design Modification

Data obtained with analysis may suggest

some design modification in the product to

improve the performance.

Field Reliability Estimation

Estimating reliability of product in field helps

companies deciding warranty policy, planning

for maintenance and preparing spare parts.

Claim/Cost Estimate

One can predict future warranty claims using

WDA data. This helps companies in reserving

budget for future warranty claims.

Some of commonly used WDA tools and the

purposes for which they are used in

automotive industry, are listed below:

1. IBM SPSS predictive analytics solutions for

warranty claims[3]:

Significantly reduce warranty costs, improve

product quality, and enhance customer

satisfaction

2. SAS® Warranty Analysis[5]

Reduce warranty costs, improve product

quality, and improve brand reputation

3. Weibull++: Life Data Analysis (Weibull

Analysis) Software Tool [8]:

Calculate reliability of products

Warranty data provides valuable insights

about the product reliability and quality to the

manufacturer. Many companies are already

recognizing importance of analyzing warranty


33

data and have started investing in warranty

programs. Study shows that warranty-related

expenses costs from 0.5 % to 7% of total

product revenue, so even small improvements

in the warranty process can lead to significant

savings and increase overall profit of

companies [3]. Even though this is true,

extracting valuable information from the

collected warranty claim data and then

processing it to predict some useful statistics

is a very tedious job and requires strong data

analytics background. In near future,

development of strong algorithms, tools, and

methods to analyze such data can be

differentiating factors in a competitive market

of warranty data analytics. Effectiveness of

warranty claim data analysis strongly depends

upon data submitted at different points in the

process and its quality. Hence, a process must

be established to improve quality of gathered

data which will help to make warranty claim

data analysis more effective.

VII. References

1. Patrick Tudor, We Predict, Alan J. Watkins and Swansea, “The Analysis of failure data and warranty claims data : A comparison and some lessons for automotive manufacturers”, May 21, 2013, Available at http://wepredict.co.uk/getfile.php?type=site_documents&id=AnalysisOfFailureData.pdf

2. Shaomin Wu “Warranty Data Analysis: A Review” Jan 10, 2012, Available at http://kar.kent.ac.uk/31005/1/QREI_UoK.pdf

3. SPSS WDA Tool- “IBM SPSS predictive analytics solutions for warranty claims”,Available at https://www950.ibm.com/events/wwe/grp/grp006.nsf/vLookupPDFs/Whitepaper_IBM%20SPSS%20Predictive%20Warranty%20Analytics[1]/$file/Whitepaper_IBM%20SPSS%20Predictive%20Warranty%20Analytics[1].pdf

4. SAS WDA Tool -“SAS® Warranty Analysis” Available at http://www.sas.com/content/dam/SAS/en_us/doc/productbrief/sas-warranty-analysis-100347.pdf

5. Warranty Data Analysis Overview, Available at http://reliawiki.org/index.php/Warranty_Data_Analysis

6. Weibull ++7 WDA Tool, Available at http://www.reliasoft.com/newsletter/v6i2/new_era.htm

7. “Life Data Analysis Reference” Feb 11, 2014, ReliaSoft Corporation, Tucson, Arizona, USA

8. Weibull Distribution, Available at http://www.reliasoft.com/Weibull/

Richard Feynman was an eminent physicist. He was also a very good artist, bongo player, and prankster. While working for the Manhattan Project at Los Alamos National Laboratory in California, Feynman found himself occasionally bored at the isolated location. "There isn't anything to do there," he used to complain. To fill the time, he played pranks on his friends. Once he discovered a unique combination to the locked filing cabinets belonging to nuclear physicist Frederic de Hoffmann. He wrote a series of cryptic notes and left them inside. After decoding those notes, Hoffman was afraid thinking that a saboteur had gained access to secrets of the atomic bomb.

An electron and a positron go

into a bar.

Positron: "You're round."

Electron: "Are you sure?"

Positron: "I'm positive."


34

There are many interesting stories about Einstein. He was known for his distractedness. Once he was traveling on a train in Germany. The conductor approached him. Before the conductor said anything, Einstein started searching his pockets for the t icket. The conductor recognized Einstein and told him that he could ride the train for free. Einstein thanked him and said “If I do not find my ticket, I would not know where to get off the train.”

Human behavior is commonly perceived as something which is difficult to predict. Think twice! Different institutions including government agencies, manufacturing industries, social networking sites, insurance agencies are armed with methods of predictive analysis that predict human behavior for different purposes. In the book, titled 'Predictive Analysis: The power to predict who will click, buy, lie and die', the author takes us on a tour of a typically lesser-known world of predictive analysis which is full of examples, entertainment, knowledge and astonishment.

In the introductory chapter, the author gives us a broader picture of data analytics, its effects and also states that predictive analysis forms an important part of 'Data Analytics'. He also makes us aware of the differentiation between forecasting and prediction, former being done at a macroscopic level, the later at the microscopic level. He also highlights that even with state of art technologies, accurate prediction is usually not possible, though these technologies take us quite closer to the facts. Enormous amount of data collected at various places such as blogs, online transactions, news, social networking sites, etc. forms the basis of prediction. From this data, one can get personal as well as collective sentiments of a group of people. The author quotes an interesting example of a firm, SNTMNT, which allows people to trade based on sentiments of people observed through twitter/blogs etc.

In the next chapter, the author then explains an important factor of predictive analysis, i.e. machine learning, through an interesting example of Chase Bank. The bank faced a challenge when number of mortgage holders increased significantly. Predictive analysis (PA) helped the bank to sail through this difficult phase by marking every micro risk in terms of its credit score, classifying loans into different categories, learning from data, and building the learning machine which would then classify every new customer with accurate prediction of risk. The author tells us the importance of learning from positive as well negative experience as well as explains simple technique of decision trees to make us understand how machines learning works. He concludes the chapter by establishing a very good point that every risk becomes an opportunity when predictive analysis is in action.

BO

OK

RE

VIE

W

Predictive Analysis:The power to predict who will click, buy, lie and dieAuthor:Eric Siegel

'The Ensemble Effect' is one of the most interesting chapters of this book, where the author takes us through an exciting story of NetFlix contest of analytics. The problem statement of the competition is to build a predictive model which can improve accuracy of movie ratings by 10% above the Netflix's own recommendation model. In order to get the accurate ratings, many teams join generating the ensemble model of predictive analysis. The dynamics of the ensemble model are explained in a detailed manner. In another interesting example , 'Watson and the Jeopardy challenge',the author tells us about an intelligent machine – Watson, built by IBM to contest in the Jeopardy! Show. He makes us aware of challenges of natural language processing, difficulties in designing an intelligent machine to analyze open questions and answering them correctly. Readers feel thrilled to learn the size of information fed to the machine and how predictive analysis helps build the machine about knowledge that it does not have so far.

One of the highlights of this book is an exclusive section, where multiple examples which were benefitted by the use of predictive analysis. These examples are from various domains including marketing, insurance, healthcare, safety and security, fraud detection, telecom. They lead us to understand how Lloyds TSB increased annual profit to 8 million pounds by improving customer satisfaction, how Microsoft developed a GPS based technology to predict one's whereabouts after multiple years, how Hewlett Packard saved $66 million over 5 years by detecting false warranty claims, how Continental airlines saved tens of millions of dollar by improving prediction of flight delays and so on. The variety of applications described in this section amazes us with the power and reach of predictive analysis. Apart from these examples, the author also discusses five effects of prediction.

All throughout the book, we keep on reading about the overreaching and surprising applications of predictive analysis. What we don't come across is even the slightest mention of probable side-effects of predictions all along these domains and applications. However, the book excites technocrats by the details of various techniques of predictive analysis and awes naïve readers by introducing the predictive analysis through very simplified, yet catchy examples.


36

Abhinav Khare

Areas of InterestIndian History, Indian Politics,Design Thinking andHuman Behavior

About the Author

Analytics Applied toReal World


37

I. Introduction

l

l

l

l

l

l

l

l

l

There is a famous saying in the field of BAI – Business Analytics and Intelligence – that “If you beat the data long enough then it will start talking”; aforementioned phrase stresses upon the fact that usefulness of data lies in its correct treatment because data will depict only what we want it to depict. Absence of this understanding leads us to deducing such incredulous relationships as between the number of sparrows chirping in Calcutta and the number of centuries made by Sachin Tendulkar. Aim of this article is to create awareness among readers about the right treatment to statistical data through application of BAI tools in order to make right business decisions.

Another important function of BAI is to solve those complex mathematical problems related to business that seem simple initially but may turn out to be highly complex when studied in detail. Though this article will not attempt to solve such problems but nevertheless tries to explain one such problem to appreciate the importance of BAI in day to day business proceedings. Traveling salesman is such a problem that involves number crunching of very high magnitude and finds similarity in various scientific and business fields; same is explained below:

If a salesman has to visit certain number of cities such that he visits each city only once and the total distance traveled is minimum and if a computer can evaluate 1 million routes per second then it will take 77,000 years to complete all the routes for just 20 cities[1].A 300 city problem will have 1.018 X

9010 possible routes. To get an idea of the scale of this problem in other domains, find below the mapping of other applications:

Circuit Board Drilling Applications ~ 17, 000 cities.

VLSI fabrication ~ 1.2 million cities

There are multiple commercial applications of BAI; some of them are listed below:

Airline industry -Flight scheduling

Dynamic pricing and revenue management

Telecommunications - Queuing theory, network algorithms

Transportation industry - Routing, logistics

Production – Inventory theory, simulation, analytical production line models, supply chain models

Engineering and development – Design optimization, scheduling, resource allocation

Finance – Portfolio optimization, capital budgeting

Like any other scientific methods, BAI too consists of numerous techniques such as linear, non-linear and logistic regression, l inear programming, Markov chain, classification trees such as CHAID and CHART and forecasting tools such as ARIMA for aforementioned business applications. In the following few sections we would get introduced to few basic BAI techniques and would understand their application in the real world.

Linear Regression analysis is clearly one of the important techniques in the field of business analytics. It is a statistical process which is used to estimate relationships among different variables; there is one dependent variable and one or more independent variable(s). Regression helps us to understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. It also helps us to understand the degree of relationship between dependent and independent variables and hence to prioritize our efforts while doing resource planning and budget allocation, launching price discounts schemes etc. Regression analysis is also very widely used for prediction especially in the fields of operations, sales, financial analysis and marketing.

Indeed Regression is a significant tool that gives a cause and effect relationship among various factors. Reliability of the results is somewhat guaranteed by validation methods such as co-efficient of determination to check the goodness of fit of regression, Analysis of Variance (ANOVA) and F test to check the overall fitness of the regression model, t-tests to validate relationship dependent and individual independent variable, and Residual analysis to check the model adequacies. However, one still needs to have clear understanding of the data that has been gathered to analyze. Regression might throw at us the results that are unexpected because of the data that has been fed. Hence, data, assumptions made while generating that data, and different use cases play a very important role in making regression a fruitful activity.

One interesting application of the regression can be the prediction of the players' auction price in IPL; let's get into the shoes of Vijay Mallya – the owner of RCB – who wants to have a guiding price of his target players before going into the auction hall. Initially, he would need all the parameters that are important for a cricketer's performance; few of the mare listed below:

II. Regression


38

l

l

l

l

l

l

l

III. Linear Logistic Regression Analysis

l

l

l

l

Batting/Bowling average (x )1

Number of sixes hit (x )2

Number of runs made/wickets taken (x )3

Whether Indian or Foreign player (x )4

Age of the player (x )5

Number of matches played (x )6

Past auction price (y)

As next step, he would need the historical data for above parameters for as many as players possible. Upon feeding this into the regression tool such as SPSS or MS-Excel, an equation showing the relationship among various parameters can be obtained; same can be depicted as below:Y=a+ a x +a x +a x +a x +a x +a x1 1 2 2 3 3 4 4 5 5 6 6

Where 'a' is a constant and a ,a ....a are the 1 2 6

co-efficient of the parameters and may take positive/negative value depending upon the past performances. Regression will also help to understand which of these parameters affect the auction price the most and which are the least significant and can be ignored.

It is another form of regression where instead of calculating the 'value' of a parameter, 'probability' of the occurrence of that value is calculated. In most of the applications, logistic regression classifies data into different categories; for example, classifying bank customers into high, medium, and low risk customers or classifying customers into who are likely to accept/deny loans from banks.

Other interesting application of logistic regression can be in the field of HR practices especially when thousands of job application are received for a coveted position or during campus recruitment drive. Following are few basic steps for using logistic regression in campus recruitment:

Firstly, data related to the candidates' educational qualification (such as marks in SSC, HSC, Graduation, Post-Graduation, e tc . , board /un ivers i ty ) ,age, past experience needs to be collected.

Additionally, similar data of the past and present employees needs to be obtained.

A logistic regression model – for different job profiles – needs to be designed based on current and past employees' data

Once model is ready, data of the applicants can be fed in order to do first level filtering based on the probability of the candidate making till the final recruitment stage

IV. ARIMA (Auto Regressive Integrated Moving Average)

V. Markov Chain

It is one of the powerful techniques for making forecast based on historical data; forecasting is a very important component of business functions such as sales, planning and operations. Typically while making a forecast, factors such as Trends(Persistent, overall upward or downward pattern over a long time), Cylicity (repeated upward and downward movements over a time horizon of 2-10 years depending on the industry), Seasonal component (regular upward/downward movements) and Other erratic/unsystematic disturbances due to unforeseen factors such as natural calamities, accidental hazards, etc. need to be considered. It is because of these aforementioned factors, traditional techniques of forecasting such as moving average and exponential smoothening may not give desired results as they fail to take care of the errors in the data. ARIMA has a component called moving average that reduces errors induced in data due to aforementioned factors.

Application

Consider an automotive Tier 1 company that is present majorly in after market segment and wants to come up with the sales forecast for coming month. To use ARIMA for forecasting, initially it needs to identify the critical factors that impact its sales figures. These factors include dealers' discount (a ), average lead 1

time (a ), promotional budget (a ) etc. As a 2 3

next step, the company needs to gather past data related to identified parameters. Once the data is worked upon using ARIMA, a model depicting the relationship among various criteria including error function can be constructed; same is depicted below:

S =α + α *S + α *S ……….. + β + β * ε + (t) 0 1 (t-1) 2 (t-2) 0 1 (t)

β * ε …………..2 (t-1)

Where coefficients 'α' and 'β' and error 'ε' are dependent on the past figures of a ,a and a 1 2 3

mentioned above. Obtained model will give the sales figure of a particular period when values of S ,S etc. are entered.(t-1) (t-2)

It is a stochastic process – a process where future state depends only on present state and independent of past states – that helps to determine the future or stable state of a present condition of various business conditions. Significant applications of Markov chain lie in areas such as market share among different competitors, value of portfolio of investment, customer value, customer retention, inventory level, etc.


39

M a r k o v c h a i n i s o f t h e f o r mt S =S * P (1)(t) (0) (t)

Where,S is the future state,(t)

S is the present state(0)

And P is the transition probability matrix.(t)

Following example depicts an application of Markov chain in retail business:

Company 'A' that runs a retail store wants to take decision on how much budget it should allocate for promotional activities for its washing powder product line. It has tracked the buying pattern of several house wives over a certain period of time for different brands of washing powders. It has the consumer purchase data such as who consistently bought certain brands, who switched frequently from one brand to another, and the quantity bought. Table 1 depicts some hypothetical values for the buying pattern of 2 washing powder brands:

Weekending by

Period No. ofcustomersthat boughtproduct of

brand 1

No. ofcustomersthat boughtproduct of

brand 2

Number ofcustomersthat bought

productsfrom both

brands

Number ofcustomersthat did not

buy productsfrom any ofthese two

brands

5-Jan 1 22 44 9 25

12-Jan 2 18 44 8 30

19-Jan 3 18 38 11 33

26-Jan 4 22 44 10 24

2-Feb 5 24 39 11 26

9-Feb 6 26 35 9 31

Table 1. Details of sales of two washing powder brands Based on the above data probability transition matrix P is calculated. It can be depicted (t)

(hypothetical values)as follows:

0.67 0.0875 0.023 0.22

0.04 0.72 0.044 0.20

0.124 0.24 0.512 0.124

0.146 0.26 0.022 0.57

Table 2. Probability Transition MatrixLet's assume that for current week buying pattern is as follows:


brand 1


brand 2

Number ofcustomersthat bought

productsfrom both

brands

Number ofcustomersthat did not

buy productsfrom any ofthese two

brands

21 50 6 23

Table 3. Current Week Buying PatternBy using equation 1 above, expected buying pattern for next week (week no. 2) can be

2calculated as S =S * P . Resultant values (t) (0) (t)

that we get after matrix multiplication areS = (40 10 25 10)(2)

This means that 40 consumers will buy product from brand no. 1, 10 will buy product from brand 2, 25will buy both the products, and 10 will not buy any of these 2 products. Hence, it makes sense to run a discount scheme for brand 1 as there is a likelihood of people buying products of brand 1 in higher numbers.

BAI has moved much beyond the numbers and has entered in the realm of 'words and languages' through social media analytics which forms the back bone of digital marketing. In the era where one bad tweet or bad Facebook post, from a dissatisfied customer can make or break the product, it has become imperative for the business firms to keep an eye on the social digital network and BAI through tools such as big data and sen t imen t ana lys i s empowers the organizations to do so. BAI together with cloud computing and big data analysis has empowered the businesses to realize the consumers' needs even before a consumer herself comes to know about it and thus has empowered the organizations to be pro-active and to be ahead of the competition.

With the world getting closely connected and the organizations getting access to trillions of bytes of raw data, there arises a strong need to make sense out of this unsystematic information; business analytics and intelligence (BAI) is the tool that helps to deduce effective message from this chaos. Apparently future is going to be more complex than our present. Business firms that are giving BAI its due respect and that are enabling and empowering its employees to get good exposure to it are rightly poised for riding next tide of growth.

VI. Other Trends in BAI

VII. Conclusion

References

[1] Teaching notes: Academic course vizBusiness

Analytics and Intelligence by Professor Dinesh

Kumar at Indian Institute of Management,

Bangalore

[2] Regression Analysis, Wikipedia Article, Online,

Available at:

http://en.wikipedia.org/wiki/Regression_analysis

[3] George P.H. Styan and Harry Smith, JR, 'Markov

Chain Applied to Marketing'


40

SC

IEN

TIS

T P

RO

FIL

E

Thomas H. Davenport

“Whether or not the term 'big data' proves faddish in the

months to come, the analysis of large volumes of

unstructured, multi-source data is here to stay,” says

the famous author and academic Thomas Davenport.

He is known as an expert in areas like analytics,

knowledge management, and business process

improvement. He is currently President's distinguished

Professor of Babson College in Information Technology

and Management, Massachusetts, USA. Thomas is

director of research at the International Institute for

Analytics. Thomas Davenport was born on October 17,

1954. He completed Ph.D. in 1980 from Harvard

University in social science and his thesis was 'Virtuous

Pagans: Unreligious People in America'. He is very

sharp and deep thinker.

Davenport has written or coauthored around fourteen

books, which include the first books on analytical

competition, reengineering of business process,

achieving value from enterprise systems, and

knowledge management. He has written more than one

hundred articles published in MIT Sloan Management

Review, Harvard Business Review, California

Management Review, and the Financial Times.

Davenport has been a columnist for Information Week,

CIO, and Darwin magazines. Some of the famous and

latest books of Davenport are Big Data at Work (2014),

Keeping up with the Quants (2013), Judgment Calls - 12

Stories of Big Decisions and the Teams that Got Them

Right (2102) , Analytics at Work (2010), Competition on

Analytics (2007), Thinking for a Living - How to Get

Better Performances and Results from Knowledge

Workers (2005). Davenport is recruited as one of the

first management thinkers to blog for Harvard Business

and the blog “The Next Big Thing” has turned as the

favorite blog for all readers as is his Wall Street Journal

blog.

In his books on analytics, he showed how prime firms

built analytical capabilities. His books explain how

those firms apply the approach of “going with the gut,” in

terms of decision taking about product pricing,

inventory maintenance, talent hiring, and so on. In

addition to this, the books expose how data analysis and

systematic reasoning is used by managers and how it

helps to improve efficiency, profits and risk management

in their day-to-day activities. Most of the books explain

about the steps of initialization of analytics. Combining

the science of quantitative analysis with the art of sound

reasoning, Analytics at Work proposes a road map and

methods for releasing the vital information buried in your

company's data.

Davenport, throughout his articles, explains about how

big data is important and differs from the result of

traditional analytics. In addition to these he also talks

about moving analytics from IT to business and

operational functions. All of his publications are good

guide to people in the areas of data analytics and

decision making. He also proposes analysts to enact as

an analytics consumer before starting their work. He also

suggests how to be an intelligent analytics consumer.

Davenport travels the world to stimulate, provoke, and fill

audiences with the innovative ideas, scenarios, and best

methods exposed in his books. In his talk, he covers a

wide range of day-to-day and upcoming topics crucial to

the mission of any organization successful. In recent

years, he usually speaks about decision-making and

analytics. Whether in talk or interview, his messages are

always clear, precise, and penetrating.

Davenport is one of the top 50 Business School

Professors in the globe. He was included as one of only

four IT management thought leaders on their “100 Most

Influential People in IT” list, by Ziff Davis. He has been

listed as one of 10 “Masters of the New Economy” by CIO

Magazine, one of 25 “E-Business Gurus” by Darwin, and

also the third leading business-strategy analyst (just

behind Peter Drucker and Tom Friedman) by Optimize

Magazine. His presence is felt around the world through

his writings, thinking, talks, new ideas and trends in

business. Thomas Davenport, with his intelligent

thoughts, various books in areas of analytics, knowledge

management, and through overwhelming talks and

seminars, has introduced a new era of analytics.

Scientist Profile

AuthorSmitha K P

Areas of Interest

Multicore Programming

Embedded Systems


41

Abstract:

Program parallelization involves multiple considerations. These include methods for data or control

parallelization, target architecture, and performance scalability. Due to number of such factors, best

parallelization strategy for a given sequential application often evolves iteratively. Researchers are

confronted with choices of parallelization methods to achieve the best possible performance. In this

paper, we share our experience in parallelizing a very large application (250K LOC) on shared memory

processors. We iteratively parallelized the application by leveraging selective benefits from automatic

as well as manual parallelization. We used YUCCA, an automatic parallelization tool, to generate

parallelized code. Using the information generated by YUCCA, we improved the performance by

modifying the parallelized code. This iterative process was continued until no further improvement was

possible. We observed performance improvement of 17% compared to 5% improvement reported in

the literature. The performance improvement was gained in very short time and despite the constraint

of having to use only SMPs for parallelization.

Authors: Smitha K.P, Aditi Sahasrabudhe, Vinay Vaidya

Method of Extracting Parallelization in VeryLarge Applications through Automated Tooland Iterative Manual Intervention


42

Published in Parallel and Distributed Processing Techniques (PDPTA) 2014, Las Vegas and Applications

Enhanced Automated Data DependencyAnalysis for Functionally CorrectParallel CodeAuthors: Prasad Pawar, Pramit Mehta, Naveen Boggarapu, and Léo Grange

Published in Parallel and Distributed Processing Techniques and Applications (PDPTA) 2014, Las Vegas

Abstract:

There is a growing interest in the migration of legacy sequential applications to multicore hardware

while ensuring functional correctness powered by automatic parallelization tools. Open MP eases

the loop parallelization process, but the functional correctness of parallelized code is not ensured.

We present a methodology to automatically analyze and prepare Open MP constructs for

automatic parallelization, guaranteeing functional correctness while benefitting from multicore

hardware capabilities. We also present a framework for procedural analysis, and emphasize the

implementation aspects of this methodology. Additionally, we cover some of the imperative

enhancements to existing dependency analysis tests, like handling of unknown loop bounds. This

method was used to parallelize in Advance Driver Assistance System (ADAS) module for Lane

Departure Warning System (LDWS), which resulted in a 300% performance increase while

maintaining functional equivalence on an ARM™ based SOC.

Re

se

arc

h P

ub

lic

ati

on

Re

se

arc

h P

ub

lic

ati

on


43

GMM Based Approach for Human FaceVerification using Relative Depth FeaturesAuthors: Ankita Jain, Krishnan Kutty and Suresh Yerva

Abstract:

Image based face detection and verification is a well-researched topic and has found many

applications. The limitation with image based techniques is that 3D real world objects are mapped on to

a 2D plane, which causes loss of 3D features of real objects. Multiple systems are available in literature

to estimate depths of objects viz. stereo images, 3D and 2.5D scanners, etc. However, these systems

require additional hardware and are expensive. In the proposed approach, a colored pattern of light

scans the face and video is captured simultaneously using a optical camera. The colored pattern is

detected in every frame using Gaussian Mixture Model (GMM). Due to non-planarity of the face, the

pattern gets distorted. Distortion in the pattern is calculated to obtain 3D information. In this way, 3D

information of face is derived from the 2D frames. Based on this data, 3D or depth features are

calculated, which are then used for face verification. This approach is robust and handles variations

due to light and insignificant periodic changes of background objects.

A New Approach forRemoving Haze from ImagesAuthors: Vinuchackravarthy Senthamilarasu,

Anusha Baskaran, Krishnan Kutty

Abstract:

The presence of suspended particles like haze, fog, mist, smoke and dust in the atmosphere

deteriorates quality of captured image. It is of paramount importance to reduce these deteriorating

effects from the image for various image based applications; viz. ADAS, CCTV surveillance, etc. In this

paper, this interesting problem of enhancing the perceptual visibility of an image that is degraded by

atmospheric haze is addressed. An efficient way of estimating the transmission map and the

atmospheric light is proposed, which is further used in reducing effects of haze from the image. The

underlying idea is to restore the true color of each pixel by using our proposed method that minimizes

the lowest of RGB values per pixel. This is accomplished using the HSV color space and the haze

image model. In comparison with the other state of the art methods that are available in literature, the

proposed method is shown to be capable of recovering better haze-free images both in terms of visual

perception and quantitative evaluation.

Published in International Conference on Advances in Computing,Communications and Informatics (ICACCI-2013)

Published in International Conference on Image Processing, Computer Vision,and Pattern Recognition (IPCV-2014)

Re

se

arc

h P

ub

lic

ati

on

Re

se

arc

h P

ub

lic

ati

on

Innovation for customers

About KPIT Technologies Limited

About CREST

Invitation to Write Articles

Format of the Articles

[email protected] .

www.kpit.com .

KPIT a trusted global IT consulting & product engineering partner focused

on co-innovating domain intensive technology solutions. We help

customers globalize their process and systems efficiently through a

unique blend of domain-intensive technology and process expertise. As

leaders in our space, we are singularly focused on co-creating technology

products and solutions to help our customers become efficient,

integrated, and innovative manufacturing enterprises. We have filed for

51 patents in the areas of Automotive Technology, Hybrid Vehicles, High

Performance Computing, Driver Safety Systems, Battery Management

System, and Semiconductors.

Center for Research in Engineering Sciences and Technology (CREST) is

focused on innovation, technology, research and development in

emerging technologies. Our vision is to build KPIT as the global leader in

selected technologies of interest, to enable free exchange of ideas, and to

create an atmosphere of innovation throughout the company. CREST is

recognized and approved R & D Center by the Dept. of Scientific and

Industrial Research, India. This journal is an endeavor to bring you the

latest in scientific research and technology.

Our forthcoming issue to be released in will be based on

. We invite you to share your knowledge by

contributing to this journal.

Your original articles should be based on the central theme of

. The length of the articles should be between

1200 to 1500 words. Appropriate references should be included at the end

of the articles. All the pictures should be from public domain and of high

resolution. Please include a brief write-up and a photograph of yourself

along with the article. The last date for submission of articles for the next

issue is

To send in your contributions, please write to

To know more about us, log on to

October 2014

“Ubiquitous Computing”

“Ubiquitous Computing”

August 28, 2014.

A solution from KPIT saves that trouble

Start on a long drive and the first thing the occupants will do is to adjust their seat position. It can get painful at times if youwant to keep switching seats through the journey. KPIT has built a solution that adjusts the seat positions automatically bysensing the height and knee position of the occupants. A dedicated team worked on varied aspects of Electronics,Mechanical, Hardware and Software to arrive at the best technology solution to be implemented. KPIT took the germ of anidea by its client and we helped our client build a prototype that can become an important feature to improve car sales.

Technology flavors: Engineering Design, Seating systems, Sensor technology

© KPIT Technologies Ltd. Images and trademarks used in KAFÉ are a property of respective owners. Flavors served in KAFÉ are illustrative only and may change in full service.

What if Dominic Toretto had to

adjust his seat each time before

racing in Fast & Furious?

Want to say something? Give a buzz!

y

For private circulation only.

TechTalk@KPIT July - September 2014

35 & 36, Rajiv Gandhi Infotech Park, Phase - 1, MIDC, Hinjewadi, Pune - 411 057, India.

Data is worthless if you don't communicate it.

Thomas H. DavenportBorn on October 17, 1954

Documents

VOL. 7, ISSUE 3, JULY - SEPT 2014 - KPIT Technologies. 7, ISSUE 3, JULY - SEPT 2014 DATA ANALYTICS l l l Data Analytics Toolbox Role of Mathematics in Analytics Let's Make Data Talk