Upload
docong
View
216
Download
0
Embed Size (px)
Citation preview
VOL. 7, ISSUE 3, JULY - SEPT 2014
DATA ANALYTICSl
l
l
Data Analytics Toolbox
Role of Mathematics in Analytics
Let's Make Data Talk l
l
l
Extracting Big Money from Small Money
Data Analysis for Warranty Claims
Analytics Applied to Real World
Colophon
TechTalk@KPIT is a quarterly journal of Science and Technology published byKPIT Technologies Limited, Pune, India.
Dr. Vinay G. VaidyaCTO,KPIT Technologies Limited,Pune, [email protected]
Shiva Ghose
Mind’sye Communication, Pune, IndiaContact : 9673005089
The individual authors are solely responsiblefor infringement, if any.All views expressed in the articles are thoseof the individual authors and neither the companynor the editorial board either agree or disagree.The information presented here is only for giving anoverview of the topic.
For Private Circulation OnlyTechTalk@KPIT
Guest Editorial
Chief Editor
Editorial and Review Committee
Designed and Published by
Suggestions and Feedback
Disclaimer
Dr. Anjali KshirsagarDirectorCentre for Modeling and SimulationUniversity of Pune,Pune, India
Aditi SahasrabudheChaitanya RajguruPranjali ModakPriti Ranadive
Editorials
Scientist Profile
Book Review
Articles
Research Publications
Guest Editorial 2
Dr. Anjali Kshirsagar
Editorial 3
Dr. Vinay Vaidya
Thomas H. Davenport 41
Smitha K. P.
Predictive Analysis: 35
The power to predict who will click, buy, lie and die
Eric Siegel
Aditi Sahasrabudhe
Let's Make Data Talk 4
Mayurika Chatterjee
Data Analytics Toolbox 10
Ramraju Indukuri
Role of Mathematics in Analytics 16
Sushant Hingane
Extracting Big Money from Small Money 22
Shiva Ghose
Data Analysis for Warranty Claims 28
Vaishali Patil
Analytics Applied to Real World 36
Abhinav Khare
42
Contents
TechTalk@KPIT, Volume 7, Issue 3, 2014
1
Guest Editorial
TechTalk@KPIT, Volume 7, Issue 3, 2014
2
Data analytics is the science of examining and analyzing raw data for the purpose of drawing conclusions and possible predictions about the underlying information. Although “big data” is not new to scientific community looking at the applications in weather forecasting or astronomical and astrophysical sciences, “big data” now seems to be a reality due to its presence in everyday life of a common man also. Internet and its usage have opened up new avenues for big data analytics. Share market and oil exploration are two areas which extensively require big data analytics and are closer to the heart of a common man due to their influence in everyday life. Financial sector heavily depends on data analytics tools to predict market trends, to optimize warranty and insurance claims applied to the real world and to mobilize customers. As a very common everyday life example of the application data analysis, I wish to quote that in some countries, the insurance companies levy higher insurance premiums for red colored cars since data predictions have indicated that those who buy these cars are aggressive drivers and are more prone to accidents. One of the articles in this issue quotes example of getting reliable information from data on the fly to keep up with the customers no matter how often they change, to predict customer response to various attractive offers of the dealers and to strike the right balance between analytics and expert judgment. Business houses rely on this striking balance for diversification of their business. Energy companies dealing with oil and gas are increasingly diverting part of their resources towards non-conventional energy depending to a large extent on their business analytics and intelligence team.
Analysis of data is the process of inspecting, cleaning, transforming, and modeling data with the goal to discover useful information, to draw conclusions and hence to suggest predictions to support decision making. It converts both structured and unstructured data into useful information and finally enables one to gain knowledge. It also involves finding patterns in heterogeneous sources of data. There is dire need for better handling of available data and good modeling tools with a wide choice of analysis algorithms for both structured and unstructured data types. Data analysis has multiple facets; it encompasses diverse techniques and is useful in different business, science, and social science domains. It is said that big data is the fuel while predictive analysis is the engine driven by this fuel to drive the life of society.
The volume, variety and velocity of data coming into an organization continue to reach unprecedented levels. Advanced statistical, data mining and machine learning algorithms do exist, however their usefulness has increased many folds due to availability of higher computational power, cheaper memory – both at processing and storage level and awareness to discover and deploy knowledge thus gained for profit of the organization, may it be at business level or individual level. Cloud computing is a boon to organizations using big data analytics. In addition, modern telecommunication tools have become cheaper so that data analytics does not add additional cost to the customers/clients.
Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business analytics and intelligence covers data analysis that relies heavily on aggregation, focusing on business information for commercial applications. Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.
Theoretical models used in scientific research are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation to predict a kind of response of the experimental set-up to a given theoretical event, thus producing simulated events which are then compared to experimental data. Calibration and validation of the theoretical models forms an important part of computational research in physical, chemical and biological sciences. Stability of results, cross-validation and sensitivity analysis are very important for any statistical modeling tool. Deterministic approach, on the other hand, depends on the setting up of model in terms of differential equations and the initial or boundary conditions. Non-linear problems involve additional degree of crucial dependence on the initial conditions.
Organizations are increasingly becoming analytics-driven to achieve breakthrough results and outperform their peers. However, it is important (i) to optimize all types of decisions either taken by individuals or are embedded in automated systems, using insights based on analytics, (ii) to provide insights from all perspectives and time horizons including historic reporting to real-time analysis with predictive modeling, (iii) to improve business outcomes and manage risk by empowering people in all roles to explore and interact with information and deliver insights to others.
This special issue of TechTalk on Data Analytics presents articles on mathematical and statistical techniques used to deal with big data and is a timely publication from various perspectives.
Professor of PhysicsDirector of Centre for Modelingand SimulationDirector of InterdisciplinarySchool of Scientific ComputingUniversity of Pune, Pune, India
Dr. Anjali Kshirsagar
Dr. Anjali Kshirsagar isone of the pioneers tostart research inComputational Sciencesin the western part of thecountry in the lateseventies and isresponsible to developthe culture of highperformance computingin scientific research.
Dr. Vinay G. VaidyaCTO KPIT Technologies Limited,Pune, India
Editorial
Please send yourfeedback to :[email protected]
One Exabyte or one million terabytes is the amount of data that is generated in the world every year. This
includes data from social sites, blogs, online transactions, news, pictures, mails, commercial websites, etc.
Then comes scientific data, scientific journals, financial data, and so on. Today, we have the ability to gather as
much data as we wish in every field. Our understanding has greatly improved in all scientific fields due to the
analysis of collected data. This data has helped scientists interpret the data, formulate mathematical models,
make predictions, and validate them for making further interpretations and predictions.
Data analytics has benefitted us in multiple ways. Today our ability to predict earthquakes and volcanic
eruptions has drastically improved thus making it impossible to see repetition of the unpredicted eruptions such
as the one that took place in Pompeii. Our understanding of cosmology has vastly improved in the past 100
years; thanks to the data we get from radio telescopes as well as the Hubble telescope. The Square Kilometer
Array (SKA), the largest radio telescope under construction, will search the sky with the ability to detect airport
radars in solar systems that are 50 light years away. The SKA will generate the same amount of data that we
generate every year in the world. The difference is the SKA will generate one Exabyte of data in just one day!
We certainly need a much better ability to process such data.
In the field of transportation we are beginning to collect more data. This data would enable us to diagnose
automotive systems better thereby helping us build robust systems. Models developed using this data would
enable us in building better tools for prognosis. Overall, data will change user experience in the transportation
industry. However, before we get to that level of accurate prognosis, we need to improve our understanding of
mathematics.
Data analytics is all about mathematics. There are number of mathematical methods available to extract
knowledge out of the data. However, it is an art to use the right set of equations to get what one is looking for.
The process of data conversion into an appropriate mathematical model is a black art and I do not see this
process ever getting automated. Today, there are many teams working all over the world trying to come up with
a model for their data sets. Not everyone succeeds. Besides not all the teams come to the same set of
equations in the end. There are multiple paths to reach the goal of prediction. The success or failure is
measured by the amount of error in prediction. There are also common pitfalls in methodology deployed to
tackle the problem.
Often I come across people who get confused by deterministic and non-deterministic data sets. Deterministic
phenomenon can be defined by specific equations and one would never observe any deviation from those
equations. Ohm's law is deterministic while Brownian motion is non-deterministic. Mathematical methods for
deterministic phenomenon are different than for non-deterministic. Once one engineer came to me and he told
me that he was using Monte Carlo method for determining state of charge of a battery. This is a classic example
of using wrong methods for trying to find answers. Obviously the end result would be garbage in garbage out.
In the endeavor of better understanding, we should remember that data is meaningless without correct
interpretation. Interpretation is short lived without formulation of a mathematical model. Mathematical model is
mere toying with equations without its good ability for prediction. Prediction is gambling without validation.
Validation is endorsement and the final stage in the pursuit of knowledge.
TechTalk@KPIT, Volume 7, Issue 3, 2014
3
TechTalk@KPIT, Volume 7, Issue 3, 2014
4
Mayurika Chattergee
Areas of InterestMechatronics and Control Systems
About the Author
Let's Make Data Talk
TechTalk@KPIT, Volume 7, Issue 3, 2014
5
I. Introduction
“Information is the oil of thest
21 century, and analytics is the
combustion engine”
- Peter Sondergaard of the Gartner Group
One may wonder why data analytics! Well,
over the recent years it has become easier to
collect and store huge amount of data. Data
perceived by different people is different. Data
for a computer engineer will be a set of
numbers but data for a medical practitioner will
be some kind of medical history. The aim is to
convert this huge data into information and
then convert this information to knowledge.
Such knowledge gives power to an
organization to make educative decisions
about its future. Data analytics can be applied
to wide range of domains; it can help business
analysts to make informative business
decisions or medical researcher. It can also be
helpful for the political analysts to describe a
political phenomenon or even to test
theories/hypothesis about political scenarios.
Let us start with what do we mean by data.
According to [1], “Data are values of
qualitative or quantitative variables, belonging
to a set of items”. Variables are a
measurement or characteristics of an item.
When we obtain data, we tend to look out for
these variables for the purpose of further
analysis. Mostly, the data we collect looks like
the one shown in Fig 1.
Figure 1 : Raw Data [Ref.2]
Raw data is not pretty! It needs processing
and thorough understanding, in order to first,
make sense out of it, and secondly to be able
to do something with it, e.g. predict future
trends.
Next in line is what this article is about - 'Data
Analytics'. In most simple language, it is about
understanding what is going on. There can be
two possibilities, one that we are in a situation
where we do not have enough data and we
have to go through different sources such as
books, web in order to find information. Then
there is the second case, where we are
overwhelmed with huge amount of data and
we need to figure out a way to extract
meaningful information from the given data.
Note that in both the cases one thing is
common, we are trying to find answer to a
question, this is one of the important aspects
of data analytics which will be clear while
going through the article.
This article introduces the data analytics term,
various categories and the basic steps
involved in the process.
Before coming to the types of analytics, let us
first learn a few related terms. The most
commonly used term in business domain is
business intelligence. It comprises of different
tools and techniques that can be applied to
analyze the data, which help the organizations
to make better business decisions [3].
Business Intelligence can be further
subcategorized into Data analytics and Data
mining. Data mining is about looking for
patterns in data that are stored in databases in
order to discover knowledge. Data Analytics
does not care about databases, but
concentrates on the specific patterns of data
as part of knowledge enhancing [4].
In this article, we are concentrating on the
analytics part of it. There are some basic
categories of data analytics, let us take an
example and try to understand the difference
amongst them.
Let us say that my job is to find out demand of
vehicle 'X' among Indian people. I need to
analyze the data gathered across the country
to find out sale of vehicle 'X'. After going
through some reports, I find out that Pune city
recorded maximum sale of the vehicle 'X' in
the year 2013. This type of analysis where we
II. Types of Analytics
TechTalk@KPIT, Volume 7, Issue 3, 2014
6
Figure 2: LED screen with dots representing percentage sale
Figure 3: LED screen showing delivery time at respective cities
The next step is what many people in industry
term as Data analytics, known as 'Predictive
analysis'. It predicts the probable future
outcomes. We can create a model involving all
the parameters that influence the sales of a
car, apply the data to the model, and
determine various relationships among them.
We can also use the LED screen to play
around with some parameters and check
influence of each to the predicted outcome.
For, e.g. per capita income of people in city Y
will influence the sale of vehicle X, depending
on the number of people who can afford it. All
such predictions are probabilistic in nature
and cannot tell for sure what will happen but
will only tell what might happen based on the
data gathered.
This takes us to the last stage where we
integrate our predictive model with real time
data to get desired results. This is known as
'Prescriptive analysis'. This is basically one
step ahead of predictive analysis where, as
the name suggests, one or more actions are
prescribed to the decision makers showing
them the likely outcome of respective
decision. It also includes a feedback
mechanism which would track the output
based on the action taken and re-prescribe an
improved decision. In the example above, the
output of prescriptive analytics would suggest
to open more service stations and improve
infrastructure that will in turn help to increase
sales of the vehicle.
1. Define objective
The first and foremost step is to define the
objective or pose the question that we would
like to get answer to. For most of us, when we
talk about data analysis, we think that data is
most important, but the truth is, it is secondary.
The data might influence the question, but if
we are not clear on the objective, even the vast
amount of data won't be able to help us.
2. Define the ideal data set
This step will give clarity on variables to be
measured. For example, if we want to
determine whether there is relationship
between height and weight of the students in a
class, it will be sensible to measure the
quantities using a scale (weighing scale or
length scale). However, if you want to find
some correlation between qualitative
variables, for example, relating confidence of
the student with their academic successes.
What scale would you use? Compare marks
vs. IQ scores?? For this particular variable,
you might need to conduct a survey among
various teachers and find out each student's
confidence level subjectively. As famously
said by Albert Einstein “Not everything that
can be counted counts and not everything that
counts can be counted”. Thus, we should
know beforehand what kind of data would be
necessary for our purpose.
3. Obtain data
Once it is decided what types of data you want
for your analysis, next step is to acquire the
data from the available sources. The data
gathering can be done in various manners. It
could be through quantitative sensors such as
accelerometers, pressure sensors, etc.,
subjective analysis such as ride comfort of a
III. Steps in data analytics
examine the past occasions is known as
'Descriptive analytics'.
Now suppose I have an LED screen in my
office and I have red dots to represent the
percentage sale of vehicle 'X' across the
country. Also, I can see the delivery time of the
vehicle X at respective cities. If I compare
these two, in an instant I can find out
correlation between the percentage of sale
and delivery time of the vehicle X. This
category of analytics is termed as 'Exploratory
analysis'. It mostly constitutes of graphical
representation of the data and involves finding
out correlation between various parameters
for decision making. Figure 2 and 3 help in
visualizing the above example of LED screen
type scenario.
TechTalk@KPIT, Volume 7, Issue 3, 2014
7
Figure 4: Steps involved in Cleaning of data Ref. [5]
5. Analyze
Next step is to explore the data, create graphs
etc. to understand the data, create clusters of
similar data. An Economist Ronald Coase
said, “If you torture the data long enough, it will
confess”. Analyzing data is of course a
science but an art as well. There can be
loopholes in your data, you might get
distracted by other interesting insights. It is
essential that one understands the purpose of
the analytics in order to get the answers. You
Figure 5: Some analysis techniques Ref. [6]
6. Mathematical Modeling
This is the main step wherein the domain
expertise comes into play. A model of the
system is created which depicts the
relationship amongst variables. This model
can then be used to predict the future
variables, which will be useful to make
recommendations. This is the part where
techniques such as machine learning,
clustering data etc. come in to picture. Mostly
this section comes under “Data analytics part”;
the previous steps can be regarded as part of
data analysis. Let us see an example of a
model that will predict the sale of a vehicle
based on different parameters across
geography.
Figure 6: Decision making model
vehicle, data from online transaction, data in
government records, and so on. In many of the
cases, the data obtained could be incomplete,
raw and unstructured. Examples of such data
are missing records, incomplete forms, data
captured using various sensors at different
locations, etc. We need to organize it to make
sense. This constitutes the next step.
4. Cleaning of data
Tidying up the data is a major task in order to
successfully and correctly analyze the data.
We might have multiple data sources and we
might need to augment the data collected
based on these sources. Even with the best
analysis techniques, the results might deviate
in a different direction due to erroneous/junk
data and will subsequently produce
misleading results. In order to yield value out
of the data, it is essential that data that matters
should be retained. Various steps involved
during data cleaning are shown in figure 4. If
the data is gathered from multiple types of
resources, they need to be merged. If some of
the data is missing, it has to be built up by
mathematical techniques like interpolation,
extrapolation, etc. The data also needs to be
normalized and duplications needs to be
removed. Ones the data is ready, it can be
used further for analytics purpose.
need to be able to realize when you might
require more data points to support or discard
your analysis. In fact, you may never know
when you may discover some very
unexpected and useful insights during your
course of analysis. Various tools and
techniques can be used to analyze the data in
hand. Initial analysis can be performed using
techniques such as multi variate analysis,
linear regression etc. In figure 5, some
analysis techniques are shown. We will
discuss about some of these techniques in
next couple of articles.
TechTalk@KPIT, Volume 7, Issue 3, 2014
8
The parameters that influence the vehicle X sales can be current trends, the infrastructure available and the economy situation of the region. Another factor that might influence the decision of the customer is the government regulations, for e.g. if the government decides that electric car or hybrid car buyer will get some benefits in taxes, people might opt for that. Thus, by taking these factors into consideration, we can generate a model of the system based on the gathered knowledge and using techniques such as linear regression, MANOVA etc. Then we try to get answers to our questions, in this case, figure out how each parameter will affect the sale of vehicle X.
7. Interpret and optimize results recursively
Once you get the results, you must be able to find out the difference between what did you expect form the data and what you see in the results. The results might be true for one set of
V. References
1. “Data”, Wikipedia, available at http://en.wikipedia.org/wiki/Data
2. Image available at http://www.ncon.com/images/rawdata.gif
3. Fari Payandeh, “BI vs. Big Data vs. Data Analytics By Example”, article in Foreground Analytics—Big Data Studio
4. Coursera course on “Core Concepts in Data Analysis”
5. Image available at http://www.datacleansing.net.au/Images/Data_Cleansing_Cycle_350px.jpg
6. Image available at http://www.rnbresearch.com/gifs/data-analysis-services.gif
data but they might vary in the other; it is essential that the variables are optimized regularly based on the most recent data.
Data analytics is about being able to push through lot of difficulties that we face when we deal with either large or messy data, that includes collecting the data, cleaning them up and coming up with various analysis techniques that exploit new information from the data. It is aquest to discover new insights from the past data. It is a combination of functional expertise, traditional research, knowledge of mathematics & statistics, and machine learning. It also requires a knack to explore different sources and figure out the answer by any means possible. In conclusion, data analytics acts like the eyes and ears of an organization that wishes to support and enhance its decision taking ability using data.
IV. Conclusion
Humor and Statistics
l
l
l
l
l
If you live to be one hundred, you've done a wonderful job. Very few people die past that age.
Two statisticians were traveling in an airplane from New York to LA. After an hour, the pilot
announced that they had lost an engine, but there are three left. However, instead of 5 hours it would
take 7 hours to get to New York. A little later, he announced that a second engine failed, and they still had two left,
but it would take 10 hours to get to New York. Somewhat later, the pilot again came on the intercom and announced
that a third engine had died. Never fear, he announced, because the plane could fly on a single engine. However, it
would now take 18 hours to get to New York. At this point, one statistician turned to the other and said, “Gee, I hope
we don't lose that last engine, or we'll be up here forever!”
Statistics plays an important role in genetics. For instance, statisticians can prove that number of offsprings is an
inherited trait. If your parent didn't have any kids, odds are that you won't have either.
It is proven that the celebration of birthdays is healthy. Statistics shows that those people who celebrate the most
birthdays become the oldest.
One day there was a fire in a wastebasket in the office of the Director of Sciences. A chemist, a physicist and a
statistician rushed in. The chemist works on which chemical agent would have to be added to the fire to prevent
oxidation. The physicist also starts to work on how much energy would have to be removed from the fire to stop the
combustion. While they are doing this, the statistician starts setting all other wastebaskets on fire. "What are you
doing?" others ask. The statistician replies, "Well, you definitely need a larger sample size to solve the problem."
TechTalk@KPIT, Volume 7, Issue 3, 2014
9
TechTalk@KPIT, Volume 7, Issue 3, 2014
10
Ramaraju Indukuri
Areas of InterestProgramming Languages,and Data Mining
About the Author
Data AnalyticsToolbox
TechTalk@KPIT, Volume 7, Issue 3, 2014
11
I. Introduction
Quick googling will give you a whole lot of information of different analytics platforms. Hence instead of going into specific tools and the comparisons, let us discuss what analytics is and use the most popular open source analytics tool R, to demonstrate some of the key elements of analytics.
To briefly mention, the most popular analytics tools are Matlab, SAS, SPSS, and R. These are mostly used as a desktop package by customers with their capacity limited by RAM, though the vendors have been coming up with server versions with some features that enable them to be linearly scalable(add more computers to improve the performance). These packages are essentially mathematical or statistical packages and mostly used by engineers or statisticians.
However, lately with the explosion of big data, platforms like Mahout on Hadoop have been increasing in their popularity. These require fair amount of programming skills and an elaborate setup, which will be a big barrier for traditional engineering and statistical community. The term data scientist in my view, came up due to the fact that these packages require not just statistics and machine learning skills, but also skills for big data platform understanding and domain knowledge.
A quick note on big data platforms- Big data platforms enable companies to not having to sample data and depend heavily on wide confidence intervals. One customer from a large bank in US, who manages her target marketing team, mentioned that their accuracy improved by 25% when they moved to machine learning using big data platforms from standard statistical analysis. It is noteworthy, that she also said, as a business user, she was highly skeptical about the big data platforms in the beginning. However, once she saw the results, she has been a big supporter of their Hadoop project.
As mentioned earlier, skills and efforts required in doing analytics using big data are significant. The current Hadoop platform requires an elaborate programming to accomplish same thing that an R package can do in few lines of code. It is expected, over next few years, there will be a significant improvement in programming frameworks on big data that will enable customers to perform analytics much quicker and faster. One noteworthy mention towards this direction is “Spark”, which uses “SCALA” programming language to simplify Mapreduce code (A programming algorithm that parallelizes computational tasks over several computer servers), though it introduces a new
programming language which traditional customers may not want to adopt given the scarcity of people.
R is an open source package that has been taking over the field of data analytics. Primary reasons are its ease of use, rich packages that abstract users away from underlying mathematical complexity and its zero cost. The key downside for R is its dependence on RAM of the machine, which restricts the size of data set it can handle. However, there are several packages developed to parallelize R and leverage linearly scalable platforms like Hadoop.
Before we jump into the problems customers solve using analytics, let us understand the meaning of analytics. There are three levels in data analytics: Data Visualization, Statistical Analysis and Machine Learning.
Using charts and smart graphics, visually present the data so that a human eye can recognize patterns in it. Most of the time, visualization will guide us to further conduct deeper analysis. Most companies have heavily invested in data visualization using their Business Intelligence systems and use these systems to gain understanding of their corporate data. As you all know bar graphs in excel are most widely used 'Analytics' done by most of us on a daily basis.
However, with the advent of various tools and technologies, visualization has advanced a lot since the bar chart. Word clouds (see figure 1), interactive network diagrams (A diagram with network of nodes and edges which depict relationships, for example a social network), heat maps (A diagram that shows area density depicted in shades of colors) have become more popular and are getting into mainstream. There is more and more emphasis by tool vendors in incorporating infographics into their packages.
II. Data Visualization
Figure 1: Data cloud
TechTalk@KPIT, Volume 7, Issue 3, 2014
12
Figure 1 illustrates a word cloud of President
Obama's 2009 speech. It quickly shows the
words he used frequently in his speech. This
can be accomplished using following script in
R.
Wordcloud (INPUTTEXT, min.freq =
WORD_FREQUENCY, random.order =
FALSE)
As you can see, building a word cloud is fairly
simple, since R has a pre-built package. This
is true for most algorithms.
The key challenge that you will quickly notice
when you try to analyze data is the data
preparation stage. It is a well-known fact that
80% of the effort in analytics is to massage the
data in a format acceptable to whichever tool
you use. Vendors like Oracle have integrated
their data mining tools with the Extract,
Transform, Load (ETL) tools, to simplify the
process of preparing the data. However, the
challenge is that the data scientists (or R
programmers) typically are not interested in
deploying another tool set and tend to use the
packages they are already conversant with.
This is a large opportunity for IT companies in
the coming years to help customers prepare
the data that data scientists can easily
process.
There are several other data visualization
platforms. Within R, ggplot package is most
popular for its versatility and ease of use in
generating all standards statistical plots. The
beauty of ggplot is its ability to generate the
graph over several layers, which significantly
de-clutter the process of developing charts.
Figure 2: Example of graph generated by ggplot
Graph shown in figure 2 can be produced by a
simple statement
Where 'stateincomes' is a table of states, their incomes along with latitude and longitude values of each state.
Figure 3. Example of Document Object Model
Visualization will not be complete unless we
touch upon d3.js, the latest tool kit in the area
of data visualization. Companies like
D a t a m e e r h a v e c r e a t e d s t u n n i n g
visualizations on top of big data platforms
using d3.js. d3's power lies in a combination of
simplified use of DOM(Document object
Model), Scalable Vector Graphics and CSS
(Cascading style sheets).
Document object model is a model that
represents structured documents like HTML
and XML (as shown in figure 3). It lets
developers access the elements within a
document like html. For example, in the java
script included in the page, to change color of
the text of an element called var1, you can use
DOM in the following way.
document.getElementByClass('var1').onmouseover.style.color = "#0000ff;';
Where as if you include the d3.js libraries in
the head, you can do the same as follows
D3.select('var1').style('color','blue')
As you can see, d3 makes navigating the
HTML document much more elegant and
simple. Now, SVGs, or scalable vector
graphics are diagrams that can be drawn with
java script and do not require gif or bit map
files. This allows developers to create nice
graphics without having to tax the network with
picture downloads. For example, to create a
circle, one can use the following
D3.select('var1').append('svg').append('circle').attr(cx:10,cy:10,r:10,fill: 'green'})
D3 has functionality that enables reading data
(JSONs or arrays) and using it to generate
charts and graphs that can be impactful for the
user, without compromising on system
performance.
TechTalk@KPIT, Volume 7, Issue 3, 2014
13
III. Statistical Analysis
Companies have been using statistics to sample data and derive meaningful conclusions for ages. Packages like SAS, SPSS have functionality that simplifies statistical analysis for business users and engineers. While the most popular statistical analysis tool used is excel, there are dizzying array of tools available for industry to use.
To illustrate what kind of analysis can be done using statistics, take the example of a car failing due to a break pad problem. Graph in figure 4 calculates and illustrates the survival probability over time for 3 different models of engines. (Note that the failure stabilizes at about 65% of the survival rate, not because there aren't any more failures after 720 days, but because the warranty has expired after 2 years (After which car manufacturer will not entertain any claims or it is current date).
Figure 4: Graph generated in R using Shiny
(A Web framework on R).
Statistical analysis is used by different users in
different ways. For example, Oracle Demantra
has several statistical algorithms built into its
demand forecasting functionality. SAS has
Survival (Weibull models) analysis available
for warranty and reliability groups.
Statistical analysis tools can be classified as
shown in figure 5:
Figure 5: Classification of Statistical Tools
As you can see, the options for statistical
analysis and programming are huge. It is very
common that companies invest in more than
one tool to do statistical analysis, since
different tools fit different applications. For
example, SAS promises special features like
Weibull analysis and survival curves in the
area of warranty and is widely used for it.
Whereas, SPSS is focused on simpler user
friendly analysis that typically sits on top of
existing application data bases.
Notable mention is R. Companies can benefit
by significant number of R trained
programmers and statisticians passing out of
universities and its collaborative developer
community that keeps adding newer
packages.
As per Wikipedia, Machine learning, a branch
of artificial intelligence, concerns the
construction and study of systems that can
learn from data. In simple terms, based on the
available historic data, machine learning
algorithms can build models which can be
used to predict an outcome.
There are two kinds of machine learning,
supervised and unsupervised. For example, if
you provide a set of historic customer financial
data and 'tag' those who defaulted, a machine
learning model can be built, which can be
applied to new set of customers and
determine whether they are going to default.
This is supervised learning. In unsupervised
learning, you will not tag, but the algorithm
automatically learns and classifies. For
example, if I provide customer data, algorithm
can classify them into set of clusters, say for
example urban young males vs. rural baby
boomers etc., which you can interpret after
seeing the results.
Most of the statistical analysis packages come
with machine learning features as well. But
there are few pure play packages, notable one
being Rapidminer. The advantage of
specialist machine learning tools, is that they
provide an easy way to cleanse and prepare
the data that the machine learning algorithms
can readily accept. The specialist tools can
also help determine the accuracy with ease
and thus, help users improve the framing of
the problem.
IV. Machine Learning
TechTalk@KPIT, Volume 7, Issue 3, 2014
14
The other notable platform for machine
learning is Rattle, a visual GUI that runs on top
of R and written in R. It provides basic but
quick way to explore smaller datasets and
build models.
However, these tools usually do a decent job
till the data reaches certain limits, beyond
which they are not scalable. Apache Mahout,
an open source big data machine learning
module can scale the complex algorithms that
cover most commonly used situations, both
supervised and unsupervised, with liner
scalability, .i.e. unlimited data capacity.
Companies can use tools like R and
Rapidminer to prove the algorithms and then
deploy them on to big data platforms as an
ongoing standard solution. Companies
have already started using PMML (Predictive
model Markup Language) to create
the redeployable models and they are
g a i n i n g i n p o p u l a r i t y . ( V i s i t
http://en.wikipedia.org/wiki/Predictive_Mode
_Markup_Language for more details).
V. Other Notables to Watch
“Spark” is a component of UC, Berkeley's big data stack, which is rapidly gaining popularity. Its power lies in leveraging memory (Using the concept of RDD) instead of depending on hard disk. RDD, or resilient distributed data set is a memory abstraction provided by spark, where in large amount of data is stored in memory across multiple servers, and can even spill over to disk if there is insufficient RAM. This makes spark particularly suitable for iterative algorithms and interactive data. Spark is proven to be multiple times faster than standard Hadoop implementations and yet can run on commodity hardware and provide linear scalability. Another advantage of spark is that it provides a consistent programming platform using Scala, java and python, though Scala fits the best. Spark can be deployed on top of Hadoop, leveraging Hadoop's distributed file system. Spark provides machine learning and streaming functionality that can serve different use cases. The major challenge for traditional companies is the skill set availability.
“Mesos” is a resource manager, again built at UC Berkley that can distribute data processing work among the servers in the clusters that can run multiple frameworks. Imagine a company running racks of Hadoop or Cassandra as well as Spark and MPI. Pre Mesos, companies had to invest in separate racks of servers for each of this framework
(though it is commodity hardware and is multiple times cheaper than your traditional HP/Oracle or IBM servers). With Mesos, companies can combine all the clusters together and use them fine grained, (at CPU Cores and RAM level) based on processing loads.
Figure 6: Mesos along with other architectures
Mesos does this by providing application
programming interface (API) to integrate
frameworks and implementing a scheduler on
frameworks. Mesos slaves installed on all the
servers, talk to master and 'Offer' CPUs and
Memory, to which mesos will respond by
allocating tasks to them after talking to
frameworks and getting the tasks from them.
As you can imagine, this will optimize the
utilization of servers tremendously and
provide high fault tolerance. Hadoop 2 (Yarn)
has implemented this mechanism. Mesos has
special advantage of easy integration and
multiple frameworks already adopting to it.
The area of analytics is fast evolving. You will
find most of these tools may change or evolve
in next 2 to 5 years. However, what is constant
is the skills required to understand and use the
data; this is going to remain the same. Good
news is that new platform vendors are coming
out with tools that are easy to use, provide
support for big data and provide machine
learning capabilities. Hence customers should
consider these new innovative toolsets,
instead of going with established providers.
KPIT can help you with making these choices
and help support those platforms.
VI. Conclusion
VII. Bibliography[1] http://cran.r-project.org/
[2] http://d3js.org/
[3] http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
[4] https://spark.apache.org/docs/latest/index.html
[5] http://rapidminer.com/
[6] http://mesos.apache.org/
[7] http://www.datameer.com
TechTalk@KPIT, Volume 7, Issue 3, 2014
15
TechTalk@KPIT, Volume 7, Issue 3, 2014
16
Sushant Hingane
Areas of InterestMathematical Modeling andSimulation, Control Systems.
About the Author
Role of Mathematicsin Analytics
TechTalk@KPIT, Volume 7, Issue 3, 2014
17
Ref. [1]
II. Categories of Mathematical Models
A. Linear vs. NonlinearIn simplified words, if for a system, all the objective functions and the restraints can be expressed in terms of linear equations, it is considered as a linear model. Otherwise, it is a nonlinear model. The linear model can be written as Y=XB+U where Y is the system output; X is a matrix that is a design matrix or the system input which is the independent parameter; U is the matrix containing errors or noise and B is the matrix containing internal parameters of the system. Once B matrix is calculated or estimated, one can interpolate the missing data between two data instances.
Figure 1 : Straight line fitting [2]
For 2-dimensional data, the first step is to plot the data points on X-Y plane. This gives a fair idea if the system is linear or nonlinear. This is a simplest method to figure out if a straight line could fit to the points or a curve.
l simplistic type of data analysis. The objective is to identify the degree or the order of the system equation and then find the system parameters. The image below shows general patterns of some of the data plots in 2 dimension. (Courtesy [3])
Polynomial fitting is very common and
l - texpressed as an exponential model, N= N e o
Where λ is the decay constant which can be calculated using several data points. N &N are o
the initial radioactive mass and mass at time t, respectively. The half-life of a radioactive element, in terms of λ, can be predicted by equating N as half of the initial mass, N .0
Radioactive decay of an element can be λ
B. Continuous vs. Discrete
Continuous model allows the system state to change any time, unlike the discrete system where the states change with a specific interval. The important analysis in this type of modeling is time series analysis, where data is constantly received with a fixed time interval. The purpose is to yield meaningful statistics and other characteristics of the data as well as the time series forecasting to predict future values.
Time series models can be classified into three broad categories;
a) Autoregressive model (AR) where the value of the output variable linearly depends on its previous value.
b) Integrated models (I) that include a differencing operation using a back-shift
TechTalk@KPIT, Volume 7, Issue 3, 2014
18
I. Introduction
A mathematical model is a representation of any real world problem or system into a simplified or abstracted equation form in order to analyse the traces of the system and behaviour to predict the future values. This turns out to be a blessed miracle when we have huge chunks of data or 'big data' going in and out of a system. The challenge lies in finding the exact formula for a better correlation, or minimal-error fit for the concerned data. Big data could be noisy, unstructured, dynamic, or even incomplete. A model should make use of the data in the form of multidimensional vectors, time series of samples, or huge matrices. Considering this, sometimes it is wiser to use the ready made models that have been proven to be fit instead scratching your head all day! The right choice of model for the data is very important since there can potentially be many models that would fit the data. However, depending on the data availability and error tolerance level, one can make an informed decision. In this article we are going to see how.
Modeling of a given system can in general be represented as Where Y is the output, X is the independent variable and B is system parameter. Most of the time, the mathematical modeling process boils down to estimating the value(s) of B. Let us discuss some of the techniques that are used in real world scenarios.
C. Probabi l ist ic (Stochast ic) vs.
Deterministic
Deterministic models comprise of the system
dynamics or system evolution that has no
randomness involved. Every set of the state
variables can be uniquely determined by the
model parameters and its previous states.
Stochastic processes, on the other hand, take
care of the randomness of the system using
One of the examples of a discrete system is the data coming from any sensor with digital output. An accelerometer, for example, that gives the acceleration values (a) with a sampling time period of 1s can be used to estimate the velocity (v) and the distance (d) as well. Using the differential equations,
and
the formulation for velocity and distance in
discrete time becomes,
v = a * t + v current current sampling interval previous
d = v * t + d current current sampling interval previous
Time (1s sampling) Acceleration Velocity Distance
0 0 0 0
1 2 2 2
2 1 3 5
3 1 4 9
4 0 4 13
5 0 4 17
6 -1 3 20
7 0 3 23
A special condition where, the mean or the
expected value and the standard deviation
is called standard normal distribution.
Figure 2 : Normal distribution
l An interesting example of normal
distribution (also known as Bell Curve
because of the shape) is a Quincunx machine
also known as Bean machine or Galton box.
When several balls are dropped in the
machine with several pins as the obstruction,
the collection of the balls at the bottom
approximates a normal distribution [5].
operator to eliminate the initial 'non-stationarity' of the data. A stationary time series is the one whose properties do not depend on the time at which the series is observed. E.g. the white noise.
c) The moving average models (MA) which as the name suggests, keeps on averaging a series of data.
Various combinations of the above models produce the models[4]such as, Auto regression integrated moving average process (ARIMA), Seasonal ARIMA process, Fractional ARIMA process (FARIMA), etc.
probabilistic approach. We will put more focus on the probabilistic models in this article. Let us discuss some of the concepts involved in stochastic system modeling process.
Probability Distribution Function
Probability distribution function of a discrete random variable is the set of probabilities associated with each of its possible values. There are some practical distribution functions such as normal or Gaussian distribution function, Binomial distribution, Poisson distribution etc. Let's take example of normal or Gaussian distribution that tells us the probability of a real observation within a certain limits. The standard deviation σ and the mean or the expectation µ contributes to the distribution function as,
TechTalk@KPIT, Volume 7, Issue 3, 2014
19
Markov ChainMarkov chain is a discrete process system, which is random and memoryless. The transition from one state to the other state depends solely on the current state and not the previous trains of the state transition. Markov chain models prove to be perfect fit for the share market scenarios.
Figure 3 : Markov chain process
In figure 3, the numbers indicate the
probabilities associated with every possible
state transition from one state to the other.
Thus, the state transition matrix becomes,
where B isu
a set of all pages linking to u and
L(v) is number of links from page v.
Notice that the sum of row elements is equal to
1.0 since it shows the distribution. With the
help of this matrix, one can estimate the
system state from a given initial state. In
simpler terms, in the conditions where we
know the initial state of the system and we
know the number of state transitions that have
occurred, we can guess the current state of
the system with a certain probability
associated with it.
Monte-Carlo simulation, mostly used to
calculate the impact of risks and uncertainties
in various forecasting models such as
financial, project management, cost etc. This
simulation can tell us how likely the results are,
depending on the ranges of estimation.
Monte-Carlo simulation comprises of several
steps of modeling approach. First is to select
the estimation range or a domain. Second is,
using probability distribution functions,
randomly generate the data points to perform
a deterministic computation.
Monte-Carlo Simulation
l In 1908, an Englishman named William
Sealy Gosset under a pseudonym 'Student'
(to hide his employment at Guinness brewery)
developed a statistical method called
Student's t-test to compare two sets of
independently collected data. The t-
distribution is a family of curves which, as the
number of degrees of freedom increases
(number o f samp les m inus one ) ,
approximates to the standard normal
distribution. This method helps in considering
or in ruling out the null hypothesis (meaning if
the measured difference is only 'by chance')
[6].
l
l
Any board game that is played with dice, e.g.
snakes and ladders, is a perfect example of
Markov chain process since the next state is
chosen randomly (with a probability
associated to it) and does not depend on any
previous states.
Google's page rank algorithm (Image
courtesy [7]) is also a good example of Markov
chain that uses a random surfer algorithm to
rank the web pages for a user's search
request. In general the page rank value for any
given page 'u' can be expressed as,
l To simplify the Monte-Carlo method, let us take an example called raindrop experiment [8] of computing the value of with this method. In this, we draw a circle inscribed in a unit square and let the raindrops fall freely on it. Green dots are the raindrops inside the circle and red ones are outside the circle but inside the unit square.
p
So with multiple data points, Monte-Carlo
simulation can be run to find the approximate
value of pi as,
TechTalk@KPIT, Volume 7, Issue 3, 2014
20
Figure 4 shows the basic principle behind Monte-Carlo simulation. X , X , X are the 1 2 3
uncertainty models which are randomly generated through probability distributions. The goal is to determine how random variations, lack of knowledge or error affect the sensitivity, performance or reliability of a system. The outcomes Y , Y can be 1 2
represented as probability distribution which also can be converted into error bars, reliability predictions, tolerance zones and confidence intervals.
Figure 4 : Monte-Carlo simulation [9]
III. Predicting the Future
Regression Analysis
In regression analysis [10], the output or the dependant variable is calculated as a linear combination of the system parameters.
As we have seen in section I, we try to establish a relation between the measured system outputs Y and the system input X. 0
and are the parts of B matrix and will be 1
calculated/estimated with experimentations. Let us assume that and are the
estimated parameters calculated using least square approximation. Let us assume that the error in calculating the output is,
ßß
We pick values of (and so forth) that minimizes the square of the error The linear regression is a very effective tool for data interpolation, which will give future estimate within the said error tolerance band.
Kalman Filter
One of the best examples of data estimation is Kalman filter [11]. Kalman filter finds its applications in missile tracking, navigation, economics, sensor data fusion, etc. Basically this technique is used to estimate unknown states of the system with minimum uncertainties. Word 'filter' suggests that, it takes noisy input but does not let the state estimation be affected by it. This technique predicts the unknown parameters of the system with better accuracy. It takes a series of measurements with modeled noise and then recursively takes the weighted average of the predictions.
and
Figure 5 : Kalman filter
IV. ConclusionMathematics is a powerful tool to subdue the quantitative and probabilistic components in any data analysis and forecasting problem. We can, with some level of confidence, not be intimidated by the buzzwords such as 'data analytics' or 'big data' and only focus on the essence of them to serve the objective. However, it remains a challenge for the designer to estimate the system behavior with a minimum estimation error. Using currently available techniques, one can foresee the future with some “likelihood” associated with it.
l Monte-Carlo simulation models are extensively used in financial data modeling in order to evaluate and analyze the financial instruments, portfolios and investments. This helps in risk evaluation, portfolio optimization and financial planning. The method is used to simulate various sources of uncertainties that affect the portfolio in question, using randomly generated uncertainties (mostly by probability distributions) and calculate representative outcome of the analysis.
References[1] Article: “Examples of Financial Applications”
Available: “http://www.lancs.ac.uk/~jamest/Group/finance1.html”
[2] “Linear regression”, Wikipedia,
Available: “http://en.wikipedia.org/wiki/Regression_analysis”
[3] Lecture notes “Numerical Methods:curve fitting techniques”,
Available: “http://kobus.ca/seminars/ugrad/NM5_curve_s02.pdf”
[4] Article “Introduction to Time Series Analysis”, Gurley
Available: “http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm”
[5] Wikipedia “Bean machine”
Available: “http://en.wikipedia.org/wiki/Bean_machine”
[6] Britannica encyclopedia “Student's t-test”
Available: “http://www.britannica.com/EBchecked/topic/569907/Students-t-test”
[7] Wikipedia“Page Rank” Available: http://en.wikipedia.org/wiki/PageRank
[8] Lecture notes, “An introduction to Monte carlo methods”,Ioana A. Cosma
and Ludger Evers, Available: “http://users.aims.ac.za/~ioana/notes-ch2.pdf”
[9] Article, “Monte-Carlo Simulation basics”,
Available: “http://www.vertex42.com/ExcelArticles/mc/MonteCarloSimulation.html”
[10] Wikipedia, “Regression analysis”
Available: “http://en.wikipedia.org/wiki/Regression_analysis”
[11] “Kalman Filter”, Wikipedia
Available: “http://en.wikipedia.org/wiki/Kalman_filter”
TechTalk@KPIT, Volume 7, Issue 3, 2014
21
TechTalk@KPIT, Volume 7, Issue 3, 2014
22
Shiva Ghose
Areas of InterestControl Systems,Machine Learningand Mechatronics
About the Author
Extracting Big Moneyfrom Small Money:A Look at Data Analysis forMicro-Financial Systems
TechTalk@KPIT, Volume 7, Issue 3, 2014
23
Abstract
I. Small lending and big ideas
There has been a lot of debate over whether
microfinance institutions are the poverty
eradicating panacea they claim to be.
Microfinance aims at providing economic
stimulus for rural communities by giving them
access to financial tools and resources. By
doing so, they fill the gaps that mainstream
financial institutions have traditionally left
behind, however, they deal with risks that are
extremely hard to quantify through
conventional methods. This complexity
inadvertently reduces their efficacy in helping
the poor.
This article addresses risk analysis in
microfinance systems – who are the key
players, why do they want to analyze risk, and
how do they do it? Starting with a little
background on microfinance, we move on to
how upcoming microfinance institutions are
driving a new wave of big data analysis in
order to reach out to the masses and provide
monetary stimulation. This topic is particularly
interesting given the hit that major financial
institutions took during the global economic
slowdown of 2009, the semi-resurgence that
we are experiencing now, and the fact that
data analysis has never been more reachable
to masses. That being said, it is important to
remember that data analysis alone is not a
substitute for cautious investment from the
lenders, and the ability to stay within your
means as a borrower.
Microfinance has its roots in rural communities
and can be traced back to almost the start of
currency usage. Without direct access to
banking or money lending organizations, rural
communities resorted to non-traditional
methods to generate capital. The 1960s and
1970s saw the advent of modern microfinance
led by Nobel laureate Muhammad Yunus, who
started the Grameen bank in Bangladesh.
When combined with the general push by
governments around the world to fund
marginal farmers, disparate microlending
initiatives evolved into well-defined communal
lending schemes. These micro enterprise
lending programs had an almost exclusive
focus on credit for income generating
activities, targeting very poor and often
women borrowers.
Early adopters of microfinance saw very
encouraging results – in many cases, rural
entrepreneurs were able to develop
businesses, and bolster local economies.
Borrowers, especially women, were able to
repay loans and set up sustainable
livelihoods. While initial success stories of
microfinance painted a very optimistic picture;
microfinance institutions often faced a very
difficult time assessing the risk involved in
dealing with rural communities.
By the late 1980s and 1990s, studies on the
impact of microfinance showed that many of
the people that the system aimed to help were
still unable to lift themselves out of poverty .
Other studies found that microloans were
most beneficial to clients who were above the
poverty line. These clients generally had the
stability to take bigger risks which usually had
better payoffs. Poorer borrowers often used
loans to sustain their lifestyle which caused
debt spirals and eventually led to loan
defaulting. It turns out that you have to be rich
in order to be poor.
Big banks and other financial institutions have
developed and used detailed models of risk
assessment for decades. These risk
assessment models allow them to ascertain,
with some accuracy, the probability of a client
returning a loan. However, models of risk that
the bigger institutions used usually fell apart
when trying to deal with rural groups. Poor
communities do not possess liquid or
immobile assets, the ability to repay high
interest loans, or records of repayment fidelity.
Adding difficult last mile access problems as
well made poor communities inaccessible to
banks. Microfinance groups which operated
only in the last mile could also reap non-
monetary benefits from funding a rural
enterprise through the invigoration of local
economies, and could use this value
generation to offset high interest rates. That
being said, microfinance groups could not
easily discern who could repay the loans
amongst their poorer clients. A direct result of
this was that microfinance institutions started
catering towards the better off rural sections,
adjusted their interest rates accordingly, and
left the poorest members behind.
Even though most micro-financial initiatives
are designed to be non-profit in nature, they
are certainly looking to be a sustainable
business and not incur losses. A key issue that
comes up is to determine which customers will
be able to pay back their loans and in the
The plot thickens
Risky business
II. Process for progress
TechTalk@KPIT, Volume 7, Issue 3, 2014
24
process contribute to the society effectively.
Unlike their big-money counterparts,
microfinance companies cannot make
generalized risk models which are applicable
across societies. However; thanks to
technological progress, the barrier to entry for
specialized and powerful number crunching
has become dramatically affordable. When
this is combined with the influx of
communication technology into rural India, we
get an unprecedented ability to connect with
low income societies through powerful
computing and communication based
technologies.
Rural borrowers do not possess the data
points, such as at credit history, and
documented assets, that financial institutions
have traditionally used in order to assess
repayment likelihoods. This lack of credit
information does not mean that the poor have
bad credit. This is where big data comes in –
modern micro-financial institutions are now
looking at utilizing the vast quantities of
disparate data that rural people generate in
order to build specialized, local risk models.
Take Vodafone's Chota Credit as an example
– the company analyses a prepaid user's
usage, regularity of payments, recharge
amounts, phone usage statistics, and can
issue nanoloans in the form of credit to the
user's balance from 10 rupees up to 197
rupees. Apart from telephonic connectivity,
modern cell phones give people easy access
to online platforms as well. Companies like
Kabbage scrutinize online profiles to
determine credit worthiness, while companies
like DemystData generate their own data by
Correlations in unlikely places
1. This is because they usually needed a loan in order to support a lifestyle that their existing income could not already support.
2. As we will see in the following sections; a financial model is only as good as the assumptions and linearizations it makes. This was one of the primary causes of the 2009 credit defaulting crisis.
3. Companies like Amazon provide cloud services that start for free! (http://aws.amazon.com/free/?sc_channel=PS&sc_campaign=AWS_Free_Tier_2013 )
4. A United Nations report suggests that Indians have better access to cell phones than toilets.
The Mahalanobis distance
How can we compare a client who has three children
with a client who recharges their phone for 40 rupees
twice a month, or another client who is a woman? Each
of these data points are as different as the features they
quantify. Professor P. C. Mahalanobis introduced a
scale-invariant distance metric that allowed for the
comparison of data-sets by looking at the statistical
variance of the data . By removing the scale of the data
and comparing values based on their statistical
properties, such as offset from the mean of a datum and
the variance of the population, the Mahalanobis distance
allows us to semantically gauge binary features, discrete
features, and continuous features with each other. This
lets us compare apples and oranges!
administrating real time tests to determine
risk. Other companies look at pulling
correlations from government statistics,
household compositions, as well as utilities
usage.
By pooling together all these non-traditional
data points, companies can try to paint a better
picture of how rural risk can be characterized.
We have talked brief ly about how
microfinance groups can get information on
clients, and from which sources; now let's look
at how they can process this data. Using these
data points, non-linear decision boundaries
will have to be designed which can break up
clients into various credit ratings. The first step
in this process is to make the data
comparable. This consists of two parts: first
we normalize each feature in amongst its
peers, and then we look at a statistical
distance metric instead of a raw distance
metric (a good example here is the use of the
Mahalanobis distance ).
Clusters, curves, and plain old hyper-planes
Now that we have a normalized feature set, we
can look at de-cluttering the data. What we are
really looking for are features that are strongly
correlated with a person's risk rating. Methods
such as principal component analysis can be
used to determine which features are strong
indicators of risk, and which features are
merely noise.
TechTalk@KPIT, Volume 7, Issue 3, 2014
25
Principle Component Analysis (PCA)
PCA is used to understand and easily
visualize which features play an important role
Figure – This is a trivialized example of how
we can use PCA to simplify a model. In sub-
figure (a), feature 1 plays a slightly bigger role
in governing the spread of the data points than
feature 2. However, the relative size of role
played by feature 2, with respect to feature 1
means we cannot afford to ignore either
feature without a substantial drop in the
model's performance.In sub-figure (b), both
Using the good features, we can now try to find
decision boundaries which can help us
classify new clients. There are numerous
machine learning tools that we can use for the
job, but broadly we can use either:
Supervised learning – where we can tell the
algorithm about the credit rating for each client
based on previous records, or
l
Figure 2 : Forms, social media, cell phones can be sources to determine credit rating.
TechTalk@KPIT, Volume 7, Issue 3, 2014
26
in determining distribution of the data. This
gives data-scientists deeper modeling
insights and helps them make better choices
when pruning data sources. shows an
example of how PCA can be used.
features have about the same control over the
data spread. In the third sub-figure,
(c1),feature 1 predominantly controls the
spread of data. This isparticularly exciting
because we can drop the second feature to
reduce the complexity of our data distribution
model for a relatively low loss in data-fidelity
as shown in (c2).
l
can look into learning relationships on their
own.
Complex data sets can be handled in the cloud
using distributed statistical tools.
Unsupervised learning – where algorithms
The final result is a decision boundary which
we can use to determine the credit worthiness
of an individual. When a new client comes in,
based on the person's feature set, we can then
make a better assessment of the person's
ability to pay back a loan.
The main goal of microfinance initiatives is to
give low income sections of society the
opportunity to access financial tools in order to
help them out of poverty. Better risk analysis
can help these financial institutions reach out
to even the poorest groups in rural
communities. Thanks to the mainstream
adoption of big data analytics, improvements
in communication, and the gaining popularity
of machine learning tools, cheap and easy
access to these resources has never been
III. Looking past the numbers
[1] R. Krieger, "The Evolution of Microfinance,"
[Online]. Available:
http://www.pbs.org/frontlineworld/stories/uganda601/hi
story.html. [Accessed 04 May 2014].
[2] M. Bateman, Why Doesn't Microfinance Work?,
Zed Books, 2010.
[3] D. Hulme and P. Mosley, Finance Against Poverty,
London: Routledge, 1996.
[4] A. Karnani, "Microfinance Misses Its Mark,"
Stanford Social Innovation Review, Summer 2007.
[Online]. Available:
http://www.ssireview.org/articles/entry/microfinance_mi
sses_its_mark. [Accessed 07 May 2014].
[5] T. Pratchett, 'Captain Samuel Vimes 'Boots' theory
of socioeconomic unfairness' - Men at Arms, Victor
Gollancz, 1993.
[6] G. Dionne, "Risk Management: History, Definition
and Critique," CIRRELT, 2013.
[7] R. Kühn, "Risk Modeling," King's College London,
23 April 2014. [Online]. Available:
http://www.mth.kcl.ac.uk/~kuehn/riskmodeling.html.
[Accessed 24 May 2014].
[8] Vodafone India Limited, "Prepaid Plans," [Online].
Available:
https://www.vodafone.in/pages/prepaid.aspx.
[Accessed 07 May 2014].
[9] D. Taylor and M. Schlein, "How Big Data Can
Expand Financial Opportunities For The World's Poor,"
Forbes, 25 April 2014. [Online]. Available:
Bibliography
http://www.forbes.com/sites/realspin/2014/04/25/how-
big-data-can-expand-financial-opportunities-for-the-
worlds-poor/. [Accessed 03 May 2014].
[10] J. EKSTROM, "MAHALANOBIS' DISTANCE
BEYOND NORMAL DISTRIBUTIONS," UCLA
Department of Statistics, Los Angeles.
[11] P. C. Mahalanobis, "On the generalised distance
in statistics," 1936.
[12] F. Salmon, "Recipe for Disaster: The Formula
That Killed Wall Street," Wired Magazine, 23 February
2009. [Online]. Available:
http://archive.wired.com/techbiz/it/magazine/17-
03/wp_quant?currentPage=all. [Accessed 04 May
2014].
[13] A. Hollis and A. Sweetman, "Complementarity,
Competition and Institutional Development: The Irish
Loan Funds through Three Centuries," 1997.
[14] Global Envision, "The History of Microfinance," 14
April 2006. [Online]. Available:
http://www.globalenvision.org/library/4/1051.
[Accessed 5 May 2014].
[15] S. Rutherford, The Poor and Their Money, 2000.
[16] United Nations University, "Greater Access to Cell
Phones Than Toilets in India: UN," United Nations
University Media Relations, 14 April 2010. [Online].
Available: http://unu.edu/media-
relations/releases/greater-access-to-cell-phones-than-
toilets-in-india.html. [Accessed 07 May 2014].
easier. The ability to have a methodical impact
in the last mile is now within reach.
However, risk analysis is only as good as the
underlying assumptions made during the
modeling process. It is important to keep these
assumptions in mind as rural risk modelling
systems develop. The financial crisis of 2007-
2008 is a good example of how indiscriminate
use of mathematical functions with disregard
to the foundations upon which they were built
can have serious consequences.
The need of the hour is judicious invigoration
of rural economies to stimulate growth. Long
term stability can come about by creating jobs
which not only provide adequate monetary
compensation for the poor, but also
opportunities to grow.
TechTalk@KPIT, Volume 7, Issue 3, 2014
27
TechTalk@KPIT, Volume 7, Issue 3, 2014
28
Vaishali Patil
Areas of InterestEmbedded Systems Designand Analysis
About the Author
Data Analytics forWarranty Claims
TechTalk@KPIT, Volume 7, Issue 3, 2014
29
I. Introduction
As a customer, when you buy a new product
like watch, mobile, TV, Play Station, or a new
CAR, what do you expect? I guess good
quality product with less price, less
maintenance cost and good reputation of
company in market. Suppose, you want to
purchase a mobile and you have two options.
One costs Rs. 5000 with 1-year warranty
period and another costs Rs. 6000 with 2
years warranty period, and you plan to use this
mobile for next 2 years. Then buying the
mobile with 2 years warranty would be
cheaper considering zero maintenance cost.
Warranty is legal assurance from the
manufacturer that if a product fails to function
as expected in warranty period then it will be
repaired or replaced by company, and
customer will not have to pay for it. In warranty,
one can't claim money refund like guarantee, if
the product fails to function as expected. Have
you ever thought of how companies decide the
warranty policies and which factors are
considered while deciding the policies? Why
don't all manufacturers simply give lifelong
warranty to attract customers and increase
product sell?
Being a product manufacturer, one should
always consider product reliability, customer
satisfaction, profitability, and competition.
These major factors influence overall
business management and strategies. These
factors need to be considered in warranty
policy and hence designing warranty policies
is a very complex task.
Current state-of-art includes data analytics of
warranty claims and data analytics of failure
data[1]. Failure data is gathered by doing
testing at manufacturer site. Generating all
test cases for field testing is very expensive
and not always feasible. Warranty claim data
is actual field data submitted by customer and
it includes information related to the time to
failure, type of failure, failure part, etc.
Manufacturer can analyze this data and find
out the cause of failures, predict the future
failures and take corrective action in advance
to minimize future warranty claims and can
save on warranty reserves. Subsequent
sections will give brief overview on warranty
claim process, Warranty Data Analysis (WDA)
methods, applications of WDA, and popular
WDA tools.
Let us see an example to understand how a
warranty claim gets processed. Suppose that
a customer buys xyz company's mobile with 6
months warranty and after 2 months customer
starts facing some problems. As the product is
in warranty period, customer goes to the
dealer from where he/she had purchased the
mobile. Dealer asks the customer to fill
warranty claim form with the details including
model number, date of purchase and
compliant details and so on. Technician at
dealer side investigates the problem. If it is
repairable at dealer's side, technician will
charge the repairing cost to the manufacturer.
If some part is damaged, dealer will place
replacement order of particular part to the
manufacturer. If the mobile is not repairable at
dealer's side, dealer will send it back to the
manufacturer for replacement.
In this entire process manufacturer is charged
with either repairing cost or replacement cost
(for some part or the entire piece). However,
the process does not stop here; now,
manufacturer has received very valuable
information like failure description, model
number of product, batch number, and
solution employed to fix the problem. How can
the manufacturer use this information?
Suppose, same type of problem is reported for
a particular batch, then the manufacturer can
use this information to find out whether the
problem is with the production line or
parts/techniques used during manufacturing
of the batch. If the problem appears to be with
the production line, manufacturer can upgrade
the production line and can avoid future
warranty claims. If the problem is with
par t i cu la r component used dur ing
manufacturing then manufacturer can claim
that amount to the supplier of the defective
part. If the problem is severe, manufacturer
can notify the customers who are using same
products who may experience similar problem
in near future. Manufacturer can also call back
such products and fix the problem in advance.
This will help to maintain company reputation
and customer satisfaction as well. However,
main challenge in this entire process is to
analyze the received warranty data.
II. Warranty Claim Process
TechTalk@KPIT, Volume 7, Issue 3, 2014
30
Warranty data is usually submitted by the
customer in the form of textual description of
the problem. This data is present in both
structured and non-structuredformat. Usually
submitted warranty data is coarse data , and
this is because of many reasons such as delay
in the reporting, aggregated data received
from the vendor, some information is missing
or vague and so on. Warranty Data Analysts
are facing challenge in analyzing such coarse
data which is of poor quality and extracting the
valuable information. Different algorithms
have been developed using data mining and
text mining techniques to extract valuable
information from such data. Then predictive
algorithm can be applied on this data to predict
future claims, failure etc. With such analysis
one of the automobile and motorcycle
manufacturers were able to reduce warranty
cases from 1.1 to 0.85 per vehicle, 5%
reduction in warranty cases with an annual
savings of €30m [3].
Once the warranty data is gathered, next step
is to analyze the data. The analysis includes
following steps
Step 1: Data fitting -From the nature of data,
select the appropriate mathematical model
(some of the models are described in
'commonly used distributions' section), which
fits the data
Step 2: Estimating Parameters – Estimate the
parameters of the model which will describe
the data appropriately (How to determine
various parameters describing mathematical
model and significance of parameters is given
in 'commonly used distributions' section)
Step 3: Analysis Result – Generate analysis
results, these results can be used to predict
future failures in the field, reliability of
components, future warranty claims etc.
Table 1 shows data of cell phone failures
collected over period of 3 years for a particular
handset maker.
From gathered data, maximum phone failures
are observed during initial quarters of service.
No. of failures decrease as time-in-service
increases. This data set can be
mathematically expressed by exponential
function. Using the obtained exponential
function, future failure for quarters 11, 12 and
13 are determined as shown in the table.
III. Warranty Data Analysis
Quarter
Number observed /10000 /10000 Phones
Phones Actual Calculated
1 12.05 12.05 0
2 11.27 9.86 1.41
3 9.68 8.07 1.61
4 10.98 6.61 4.37
5 6.04 5.41 0.63
6 3.36 4.43 -1.06
7 2.59 3.62 -1.03
8 2.00 2.97 -0.96
9 1.16 2.43 -1.26
10 1.26 1.99 -0.72
11 NA 1.63 0
12 NA 1.33 0
13 NA 1.09 0
Sum of Error Square 29.39
No of Failures No of Failures Error
Table 1. Cell Phones Failure Data
TechTalk@KPIT, Volume 7, Issue 3, 2014
31
TechTalk@KPIT, Volume 7, Issue 3, 2014
32
Fig. 5. Effect of σ on Normal Distribution pdf [7]
3. Weibull Distribution
Weibull distribution is a versatile function,
widely used in the reliability engineering. It
takes forms of other types of distributions
depending on the value of β. Exponential
distribution is special case of Weibull
distribution where β =1, whereas, for β=3, it
takes the form of normal distribution.
Following graph shows different shapes of
Weibull distribution with changing values of β.
Fig. 6. Weibull pdf with 0 < β < 1, β =1, and β<1 [7]
3-parameter Weibull function is defined as
Where, γ:Location parameter, failure starts to
occur after γ β: Shape parameter η: Scale
parameter, increasing the η with constant β
has effect of stretching out the Probability
Distribution Function as shown in below figure
Fig. 7. Weibull pdf plot with varying values of η
IV. Applications of WDA
V. WDA analysis tools
VI. Conclusion
Information extracted from WDA can be used
by companies to plan various business and
management strategies. Some of the
widespread applications of WDA in industry
are [2]
Early Detection of Reliability Problem
WDA can give early indication of future
failures. With this information, companies can
implement solutions to avoid such failures and
thus reduce future warranty claims.
Design Modification
Data obtained with analysis may suggest
some design modification in the product to
improve the performance.
Field Reliability Estimation
Estimating reliability of product in field helps
companies deciding warranty policy, planning
for maintenance and preparing spare parts.
Claim/Cost Estimate
One can predict future warranty claims using
WDA data. This helps companies in reserving
budget for future warranty claims.
Some of commonly used WDA tools and the
purposes for which they are used in
automotive industry, are listed below:
1. IBM SPSS predictive analytics solutions for
warranty claims[3]:
Significantly reduce warranty costs, improve
product quality, and enhance customer
satisfaction
2. SAS® Warranty Analysis[5]
Reduce warranty costs, improve product
quality, and improve brand reputation
3. Weibull++: Life Data Analysis (Weibull
Analysis) Software Tool [8]:
Calculate reliability of products
Warranty data provides valuable insights
about the product reliability and quality to the
manufacturer. Many companies are already
recognizing importance of analyzing warranty
TechTalk@KPIT, Volume 7, Issue 3, 2014
33
data and have started investing in warranty
programs. Study shows that warranty-related
expenses costs from 0.5 % to 7% of total
product revenue, so even small improvements
in the warranty process can lead to significant
savings and increase overall profit of
companies [3]. Even though this is true,
extracting valuable information from the
collected warranty claim data and then
processing it to predict some useful statistics
is a very tedious job and requires strong data
analytics background. In near future,
development of strong algorithms, tools, and
methods to analyze such data can be
differentiating factors in a competitive market
of warranty data analytics. Effectiveness of
warranty claim data analysis strongly depends
upon data submitted at different points in the
process and its quality. Hence, a process must
be established to improve quality of gathered
data which will help to make warranty claim
data analysis more effective.
VII. References
1. Patrick Tudor, We Predict, Alan J. Watkins and Swansea, “The Analysis of failure data and warranty claims data : A comparison and some lessons for automotive manufacturers”, May 21, 2013, Available at http://wepredict.co.uk/getfile.php?type=site_documents&id=AnalysisOfFailureData.pdf
2. Shaomin Wu “Warranty Data Analysis: A Review” Jan 10, 2012, Available at http://kar.kent.ac.uk/31005/1/QREI_UoK.pdf
3. SPSS WDA Tool- “IBM SPSS predictive analytics solutions for warranty claims”,Available at https://www950.ibm.com/events/wwe/grp/grp006.nsf/vLookupPDFs/Whitepaper_IBM%20SPSS%20Predictive%20Warranty%20Analytics[1]/$file/Whitepaper_IBM%20SPSS%20Predictive%20Warranty%20Analytics[1].pdf
4. SAS WDA Tool -“SAS® Warranty Analysis” Available at http://www.sas.com/content/dam/SAS/en_us/doc/productbrief/sas-warranty-analysis-100347.pdf
5. Warranty Data Analysis Overview, Available at http://reliawiki.org/index.php/Warranty_Data_Analysis
6. Weibull ++7 WDA Tool, Available at http://www.reliasoft.com/newsletter/v6i2/new_era.htm
7. “Life Data Analysis Reference” Feb 11, 2014, ReliaSoft Corporation, Tucson, Arizona, USA
8. Weibull Distribution, Available at http://www.reliasoft.com/Weibull/
Richard Feynman was an eminent physicist. He was also a very good artist, bongo player, and prankster. While working for the Manhattan Project at Los Alamos National Laboratory in California, Feynman found himself occasionally bored at the isolated location. "There isn't anything to do there," he used to complain. To fill the time, he played pranks on his friends. Once he discovered a unique combination to the locked filing cabinets belonging to nuclear physicist Frederic de Hoffmann. He wrote a series of cryptic notes and left them inside. After decoding those notes, Hoffman was afraid thinking that a saboteur had gained access to secrets of the atomic bomb.
An electron and a positron go
into a bar.
Positron: "You're round."
Electron: "Are you sure?"
Positron: "I'm positive."
TechTalk@KPIT, Volume 7, Issue 3, 2014
34
There are many interesting stories about Einstein. He was known for his distractedness. Once he was traveling on a train in Germany. The conductor approached him. Before the conductor said anything, Einstein started searching his pockets for the t icket. The conductor recognized Einstein and told him that he could ride the train for free. Einstein thanked him and said “If I do not find my ticket, I would not know where to get off the train.”
Human behavior is commonly perceived as something which is difficult to predict. Think twice! Different institutions including government agencies, manufacturing industries, social networking sites, insurance agencies are armed with methods of predictive analysis that predict human behavior for different purposes. In the book, titled 'Predictive Analysis: The power to predict who will click, buy, lie and die', the author takes us on a tour of a typically lesser-known world of predictive analysis which is full of examples, entertainment, knowledge and astonishment.
In the introductory chapter, the author gives us a broader picture of data analytics, its effects and also states that predictive analysis forms an important part of 'Data Analytics'. He also makes us aware of the differentiation between forecasting and prediction, former being done at a macroscopic level, the later at the microscopic level. He also highlights that even with state of art technologies, accurate prediction is usually not possible, though these technologies take us quite closer to the facts. Enormous amount of data collected at various places such as blogs, online transactions, news, social networking sites, etc. forms the basis of prediction. From this data, one can get personal as well as collective sentiments of a group of people. The author quotes an interesting example of a firm, SNTMNT, which allows people to trade based on sentiments of people observed through twitter/blogs etc.
In the next chapter, the author then explains an important factor of predictive analysis, i.e. machine learning, through an interesting example of Chase Bank. The bank faced a challenge when number of mortgage holders increased significantly. Predictive analysis (PA) helped the bank to sail through this difficult phase by marking every micro risk in terms of its credit score, classifying loans into different categories, learning from data, and building the learning machine which would then classify every new customer with accurate prediction of risk. The author tells us the importance of learning from positive as well negative experience as well as explains simple technique of decision trees to make us understand how machines learning works. He concludes the chapter by establishing a very good point that every risk becomes an opportunity when predictive analysis is in action.
BO
OK
RE
VIE
W
Predictive Analysis:The power to predict who will click, buy, lie and dieAuthor:Eric Siegel
'The Ensemble Effect' is one of the most interesting chapters of this book, where the author takes us through an exciting story of NetFlix contest of analytics. The problem statement of the competition is to build a predictive model which can improve accuracy of movie ratings by 10% above the Netflix's own recommendation model. In order to get the accurate ratings, many teams join generating the ensemble model of predictive analysis. The dynamics of the ensemble model are explained in a detailed manner. In another interesting example , 'Watson and the Jeopardy challenge',the author tells us about an intelligent machine – Watson, built by IBM to contest in the Jeopardy! Show. He makes us aware of challenges of natural language processing, difficulties in designing an intelligent machine to analyze open questions and answering them correctly. Readers feel thrilled to learn the size of information fed to the machine and how predictive analysis helps build the machine about knowledge that it does not have so far.
One of the highlights of this book is an exclusive section, where multiple examples which were benefitted by the use of predictive analysis. These examples are from various domains including marketing, insurance, healthcare, safety and security, fraud detection, telecom. They lead us to understand how Lloyds TSB increased annual profit to 8 million pounds by improving customer satisfaction, how Microsoft developed a GPS based technology to predict one's whereabouts after multiple years, how Hewlett Packard saved $66 million over 5 years by detecting false warranty claims, how Continental airlines saved tens of millions of dollar by improving prediction of flight delays and so on. The variety of applications described in this section amazes us with the power and reach of predictive analysis. Apart from these examples, the author also discusses five effects of prediction.
All throughout the book, we keep on reading about the overreaching and surprising applications of predictive analysis. What we don't come across is even the slightest mention of probable side-effects of predictions all along these domains and applications. However, the book excites technocrats by the details of various techniques of predictive analysis and awes naïve readers by introducing the predictive analysis through very simplified, yet catchy examples.
TechTalk@KPIT, Volume 7, Issue 3, 2014
36
Abhinav Khare
Areas of InterestIndian History, Indian Politics,Design Thinking andHuman Behavior
About the Author
Analytics Applied toReal World
TechTalk@KPIT, Volume 7, Issue 3, 2014
37
I. Introduction
l
l
l
l
l
l
l
l
l
There is a famous saying in the field of BAI – Business Analytics and Intelligence – that “If you beat the data long enough then it will start talking”; aforementioned phrase stresses upon the fact that usefulness of data lies in its correct treatment because data will depict only what we want it to depict. Absence of this understanding leads us to deducing such incredulous relationships as between the number of sparrows chirping in Calcutta and the number of centuries made by Sachin Tendulkar. Aim of this article is to create awareness among readers about the right treatment to statistical data through application of BAI tools in order to make right business decisions.
Another important function of BAI is to solve those complex mathematical problems related to business that seem simple initially but may turn out to be highly complex when studied in detail. Though this article will not attempt to solve such problems but nevertheless tries to explain one such problem to appreciate the importance of BAI in day to day business proceedings. Traveling salesman is such a problem that involves number crunching of very high magnitude and finds similarity in various scientific and business fields; same is explained below:
If a salesman has to visit certain number of cities such that he visits each city only once and the total distance traveled is minimum and if a computer can evaluate 1 million routes per second then it will take 77,000 years to complete all the routes for just 20 cities[1].A 300 city problem will have 1.018 X
9010 possible routes. To get an idea of the scale of this problem in other domains, find below the mapping of other applications:
Circuit Board Drilling Applications ~ 17, 000 cities.
VLSI fabrication ~ 1.2 million cities
There are multiple commercial applications of BAI; some of them are listed below:
Airline industry -Flight scheduling
Dynamic pricing and revenue management
Telecommunications - Queuing theory, network algorithms
Transportation industry - Routing, logistics
Production – Inventory theory, simulation, analytical production line models, supply chain models
Engineering and development – Design optimization, scheduling, resource allocation
Finance – Portfolio optimization, capital budgeting
Like any other scientific methods, BAI too consists of numerous techniques such as linear, non-linear and logistic regression, l inear programming, Markov chain, classification trees such as CHAID and CHART and forecasting tools such as ARIMA for aforementioned business applications. In the following few sections we would get introduced to few basic BAI techniques and would understand their application in the real world.
Linear Regression analysis is clearly one of the important techniques in the field of business analytics. It is a statistical process which is used to estimate relationships among different variables; there is one dependent variable and one or more independent variable(s). Regression helps us to understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. It also helps us to understand the degree of relationship between dependent and independent variables and hence to prioritize our efforts while doing resource planning and budget allocation, launching price discounts schemes etc. Regression analysis is also very widely used for prediction especially in the fields of operations, sales, financial analysis and marketing.
Indeed Regression is a significant tool that gives a cause and effect relationship among various factors. Reliability of the results is somewhat guaranteed by validation methods such as co-efficient of determination to check the goodness of fit of regression, Analysis of Variance (ANOVA) and F test to check the overall fitness of the regression model, t-tests to validate relationship dependent and individual independent variable, and Residual analysis to check the model adequacies. However, one still needs to have clear understanding of the data that has been gathered to analyze. Regression might throw at us the results that are unexpected because of the data that has been fed. Hence, data, assumptions made while generating that data, and different use cases play a very important role in making regression a fruitful activity.
One interesting application of the regression can be the prediction of the players' auction price in IPL; let's get into the shoes of Vijay Mallya – the owner of RCB – who wants to have a guiding price of his target players before going into the auction hall. Initially, he would need all the parameters that are important for a cricketer's performance; few of the mare listed below:
II. Regression
TechTalk@KPIT, Volume 7, Issue 3, 2014
38
l
l
l
l
l
l
l
III. Linear Logistic Regression Analysis
l
l
l
l
Batting/Bowling average (x )1
Number of sixes hit (x )2
Number of runs made/wickets taken (x )3
Whether Indian or Foreign player (x )4
Age of the player (x )5
Number of matches played (x )6
Past auction price (y)
As next step, he would need the historical data for above parameters for as many as players possible. Upon feeding this into the regression tool such as SPSS or MS-Excel, an equation showing the relationship among various parameters can be obtained; same can be depicted as below:Y=a+ a x +a x +a x +a x +a x +a x1 1 2 2 3 3 4 4 5 5 6 6
Where 'a' is a constant and a ,a ....a are the 1 2 6
co-efficient of the parameters and may take positive/negative value depending upon the past performances. Regression will also help to understand which of these parameters affect the auction price the most and which are the least significant and can be ignored.
It is another form of regression where instead of calculating the 'value' of a parameter, 'probability' of the occurrence of that value is calculated. In most of the applications, logistic regression classifies data into different categories; for example, classifying bank customers into high, medium, and low risk customers or classifying customers into who are likely to accept/deny loans from banks.
Other interesting application of logistic regression can be in the field of HR practices especially when thousands of job application are received for a coveted position or during campus recruitment drive. Following are few basic steps for using logistic regression in campus recruitment:
Firstly, data related to the candidates' educational qualification (such as marks in SSC, HSC, Graduation, Post-Graduation, e tc . , board /un ivers i ty ) ,age, past experience needs to be collected.
Additionally, similar data of the past and present employees needs to be obtained.
A logistic regression model – for different job profiles – needs to be designed based on current and past employees' data
Once model is ready, data of the applicants can be fed in order to do first level filtering based on the probability of the candidate making till the final recruitment stage
IV. ARIMA (Auto Regressive Integrated Moving Average)
V. Markov Chain
It is one of the powerful techniques for making forecast based on historical data; forecasting is a very important component of business functions such as sales, planning and operations. Typically while making a forecast, factors such as Trends(Persistent, overall upward or downward pattern over a long time), Cylicity (repeated upward and downward movements over a time horizon of 2-10 years depending on the industry), Seasonal component (regular upward/downward movements) and Other erratic/unsystematic disturbances due to unforeseen factors such as natural calamities, accidental hazards, etc. need to be considered. It is because of these aforementioned factors, traditional techniques of forecasting such as moving average and exponential smoothening may not give desired results as they fail to take care of the errors in the data. ARIMA has a component called moving average that reduces errors induced in data due to aforementioned factors.
Application
Consider an automotive Tier 1 company that is present majorly in after market segment and wants to come up with the sales forecast for coming month. To use ARIMA for forecasting, initially it needs to identify the critical factors that impact its sales figures. These factors include dealers' discount (a ), average lead 1
time (a ), promotional budget (a ) etc. As a 2 3
next step, the company needs to gather past data related to identified parameters. Once the data is worked upon using ARIMA, a model depicting the relationship among various criteria including error function can be constructed; same is depicted below:
S =α + α *S + α *S ……….. + β + β * ε + (t) 0 1 (t-1) 2 (t-2) 0 1 (t)
β * ε …………..2 (t-1)
Where coefficients 'α' and 'β' and error 'ε' are dependent on the past figures of a ,a and a 1 2 3
mentioned above. Obtained model will give the sales figure of a particular period when values of S ,S etc. are entered.(t-1) (t-2)
It is a stochastic process – a process where future state depends only on present state and independent of past states – that helps to determine the future or stable state of a present condition of various business conditions. Significant applications of Markov chain lie in areas such as market share among different competitors, value of portfolio of investment, customer value, customer retention, inventory level, etc.
TechTalk@KPIT, Volume 7, Issue 3, 2014
39
M a r k o v c h a i n i s o f t h e f o r mt S =S * P (1)(t) (0) (t)
Where,S is the future state,(t)
S is the present state(0)
And P is the transition probability matrix.(t)
Following example depicts an application of Markov chain in retail business:
Company 'A' that runs a retail store wants to take decision on how much budget it should allocate for promotional activities for its washing powder product line. It has tracked the buying pattern of several house wives over a certain period of time for different brands of washing powders. It has the consumer purchase data such as who consistently bought certain brands, who switched frequently from one brand to another, and the quantity bought. Table 1 depicts some hypothetical values for the buying pattern of 2 washing powder brands:
Weekending by
Period No. ofcustomersthat boughtproduct of
brand 1
No. ofcustomersthat boughtproduct of
brand 2
Number ofcustomersthat bought
productsfrom both
brands
Number ofcustomersthat did not
buy productsfrom any ofthese two
brands
5-Jan 1 22 44 9 25
12-Jan 2 18 44 8 30
19-Jan 3 18 38 11 33
26-Jan 4 22 44 10 24
2-Feb 5 24 39 11 26
9-Feb 6 26 35 9 31
Table 1. Details of sales of two washing powder brands Based on the above data probability transition matrix P is calculated. It can be depicted (t)
(hypothetical values)as follows:
0.67 0.0875 0.023 0.22
0.04 0.72 0.044 0.20
0.124 0.24 0.512 0.124
0.146 0.26 0.022 0.57
Table 2. Probability Transition MatrixLet's assume that for current week buying pattern is as follows:
No. ofcustomersthat boughtproduct of
brand 1
No. ofcustomersthat boughtproduct of
brand 2
Number ofcustomersthat bought
productsfrom both
brands
Number ofcustomersthat did not
buy productsfrom any ofthese two
brands
21 50 6 23
Table 3. Current Week Buying PatternBy using equation 1 above, expected buying pattern for next week (week no. 2) can be
2calculated as S =S * P . Resultant values (t) (0) (t)
that we get after matrix multiplication areS = (40 10 25 10)(2)
This means that 40 consumers will buy product from brand no. 1, 10 will buy product from brand 2, 25will buy both the products, and 10 will not buy any of these 2 products. Hence, it makes sense to run a discount scheme for brand 1 as there is a likelihood of people buying products of brand 1 in higher numbers.
BAI has moved much beyond the numbers and has entered in the realm of 'words and languages' through social media analytics which forms the back bone of digital marketing. In the era where one bad tweet or bad Facebook post, from a dissatisfied customer can make or break the product, it has become imperative for the business firms to keep an eye on the social digital network and BAI through tools such as big data and sen t imen t ana lys i s empowers the organizations to do so. BAI together with cloud computing and big data analysis has empowered the businesses to realize the consumers' needs even before a consumer herself comes to know about it and thus has empowered the organizations to be pro-active and to be ahead of the competition.
With the world getting closely connected and the organizations getting access to trillions of bytes of raw data, there arises a strong need to make sense out of this unsystematic information; business analytics and intelligence (BAI) is the tool that helps to deduce effective message from this chaos. Apparently future is going to be more complex than our present. Business firms that are giving BAI its due respect and that are enabling and empowering its employees to get good exposure to it are rightly poised for riding next tide of growth.
VI. Other Trends in BAI
VII. Conclusion
References
[1] Teaching notes: Academic course vizBusiness
Analytics and Intelligence by Professor Dinesh
Kumar at Indian Institute of Management,
Bangalore
[2] Regression Analysis, Wikipedia Article, Online,
Available at:
http://en.wikipedia.org/wiki/Regression_analysis
[3] George P.H. Styan and Harry Smith, JR, 'Markov
Chain Applied to Marketing'
TechTalk@KPIT, Volume 7, Issue 3, 2014
40
SC
IEN
TIS
T P
RO
FIL
E
Thomas H. Davenport
“Whether or not the term 'big data' proves faddish in the
months to come, the analysis of large volumes of
unstructured, multi-source data is here to stay,” says
the famous author and academic Thomas Davenport.
He is known as an expert in areas like analytics,
knowledge management, and business process
improvement. He is currently President's distinguished
Professor of Babson College in Information Technology
and Management, Massachusetts, USA. Thomas is
director of research at the International Institute for
Analytics. Thomas Davenport was born on October 17,
1954. He completed Ph.D. in 1980 from Harvard
University in social science and his thesis was 'Virtuous
Pagans: Unreligious People in America'. He is very
sharp and deep thinker.
Davenport has written or coauthored around fourteen
books, which include the first books on analytical
competition, reengineering of business process,
achieving value from enterprise systems, and
knowledge management. He has written more than one
hundred articles published in MIT Sloan Management
Review, Harvard Business Review, California
Management Review, and the Financial Times.
Davenport has been a columnist for Information Week,
CIO, and Darwin magazines. Some of the famous and
latest books of Davenport are Big Data at Work (2014),
Keeping up with the Quants (2013), Judgment Calls - 12
Stories of Big Decisions and the Teams that Got Them
Right (2102) , Analytics at Work (2010), Competition on
Analytics (2007), Thinking for a Living - How to Get
Better Performances and Results from Knowledge
Workers (2005). Davenport is recruited as one of the
first management thinkers to blog for Harvard Business
and the blog “The Next Big Thing” has turned as the
favorite blog for all readers as is his Wall Street Journal
blog.
In his books on analytics, he showed how prime firms
built analytical capabilities. His books explain how
those firms apply the approach of “going with the gut,” in
terms of decision taking about product pricing,
inventory maintenance, talent hiring, and so on. In
addition to this, the books expose how data analysis and
systematic reasoning is used by managers and how it
helps to improve efficiency, profits and risk management
in their day-to-day activities. Most of the books explain
about the steps of initialization of analytics. Combining
the science of quantitative analysis with the art of sound
reasoning, Analytics at Work proposes a road map and
methods for releasing the vital information buried in your
company's data.
Davenport, throughout his articles, explains about how
big data is important and differs from the result of
traditional analytics. In addition to these he also talks
about moving analytics from IT to business and
operational functions. All of his publications are good
guide to people in the areas of data analytics and
decision making. He also proposes analysts to enact as
an analytics consumer before starting their work. He also
suggests how to be an intelligent analytics consumer.
Davenport travels the world to stimulate, provoke, and fill
audiences with the innovative ideas, scenarios, and best
methods exposed in his books. In his talk, he covers a
wide range of day-to-day and upcoming topics crucial to
the mission of any organization successful. In recent
years, he usually speaks about decision-making and
analytics. Whether in talk or interview, his messages are
always clear, precise, and penetrating.
Davenport is one of the top 50 Business School
Professors in the globe. He was included as one of only
four IT management thought leaders on their “100 Most
Influential People in IT” list, by Ziff Davis. He has been
listed as one of 10 “Masters of the New Economy” by CIO
Magazine, one of 25 “E-Business Gurus” by Darwin, and
also the third leading business-strategy analyst (just
behind Peter Drucker and Tom Friedman) by Optimize
Magazine. His presence is felt around the world through
his writings, thinking, talks, new ideas and trends in
business. Thomas Davenport, with his intelligent
thoughts, various books in areas of analytics, knowledge
management, and through overwhelming talks and
seminars, has introduced a new era of analytics.
Scientist Profile
AuthorSmitha K P
Areas of Interest
Multicore Programming
Embedded Systems
TechTalk@KPIT, Volume 7, Issue 3, 2014
41
Abstract:
Program parallelization involves multiple considerations. These include methods for data or control
parallelization, target architecture, and performance scalability. Due to number of such factors, best
parallelization strategy for a given sequential application often evolves iteratively. Researchers are
confronted with choices of parallelization methods to achieve the best possible performance. In this
paper, we share our experience in parallelizing a very large application (250K LOC) on shared memory
processors. We iteratively parallelized the application by leveraging selective benefits from automatic
as well as manual parallelization. We used YUCCA, an automatic parallelization tool, to generate
parallelized code. Using the information generated by YUCCA, we improved the performance by
modifying the parallelized code. This iterative process was continued until no further improvement was
possible. We observed performance improvement of 17% compared to 5% improvement reported in
the literature. The performance improvement was gained in very short time and despite the constraint
of having to use only SMPs for parallelization.
Authors: Smitha K.P, Aditi Sahasrabudhe, Vinay Vaidya
Method of Extracting Parallelization in VeryLarge Applications through Automated Tooland Iterative Manual Intervention
TechTalk@KPIT, Volume 7, Issue 3, 2014
42
Published in Parallel and Distributed Processing Techniques (PDPTA) 2014, Las Vegas and Applications
Enhanced Automated Data DependencyAnalysis for Functionally CorrectParallel CodeAuthors: Prasad Pawar, Pramit Mehta, Naveen Boggarapu, and Léo Grange
Published in Parallel and Distributed Processing Techniques and Applications (PDPTA) 2014, Las Vegas
Abstract:
There is a growing interest in the migration of legacy sequential applications to multicore hardware
while ensuring functional correctness powered by automatic parallelization tools. Open MP eases
the loop parallelization process, but the functional correctness of parallelized code is not ensured.
We present a methodology to automatically analyze and prepare Open MP constructs for
automatic parallelization, guaranteeing functional correctness while benefitting from multicore
hardware capabilities. We also present a framework for procedural analysis, and emphasize the
implementation aspects of this methodology. Additionally, we cover some of the imperative
enhancements to existing dependency analysis tests, like handling of unknown loop bounds. This
method was used to parallelize in Advance Driver Assistance System (ADAS) module for Lane
Departure Warning System (LDWS), which resulted in a 300% performance increase while
maintaining functional equivalence on an ARM™ based SOC.
Re
se
arc
h P
ub
lic
ati
on
Re
se
arc
h P
ub
lic
ati
on
TechTalk@KPIT, Volume 7, Issue 3, 2014
43
GMM Based Approach for Human FaceVerification using Relative Depth FeaturesAuthors: Ankita Jain, Krishnan Kutty and Suresh Yerva
Abstract:
Image based face detection and verification is a well-researched topic and has found many
applications. The limitation with image based techniques is that 3D real world objects are mapped on to
a 2D plane, which causes loss of 3D features of real objects. Multiple systems are available in literature
to estimate depths of objects viz. stereo images, 3D and 2.5D scanners, etc. However, these systems
require additional hardware and are expensive. In the proposed approach, a colored pattern of light
scans the face and video is captured simultaneously using a optical camera. The colored pattern is
detected in every frame using Gaussian Mixture Model (GMM). Due to non-planarity of the face, the
pattern gets distorted. Distortion in the pattern is calculated to obtain 3D information. In this way, 3D
information of face is derived from the 2D frames. Based on this data, 3D or depth features are
calculated, which are then used for face verification. This approach is robust and handles variations
due to light and insignificant periodic changes of background objects.
A New Approach forRemoving Haze from ImagesAuthors: Vinuchackravarthy Senthamilarasu,
Anusha Baskaran, Krishnan Kutty
Abstract:
The presence of suspended particles like haze, fog, mist, smoke and dust in the atmosphere
deteriorates quality of captured image. It is of paramount importance to reduce these deteriorating
effects from the image for various image based applications; viz. ADAS, CCTV surveillance, etc. In this
paper, this interesting problem of enhancing the perceptual visibility of an image that is degraded by
atmospheric haze is addressed. An efficient way of estimating the transmission map and the
atmospheric light is proposed, which is further used in reducing effects of haze from the image. The
underlying idea is to restore the true color of each pixel by using our proposed method that minimizes
the lowest of RGB values per pixel. This is accomplished using the HSV color space and the haze
image model. In comparison with the other state of the art methods that are available in literature, the
proposed method is shown to be capable of recovering better haze-free images both in terms of visual
perception and quantitative evaluation.
Published in International Conference on Advances in Computing,Communications and Informatics (ICACCI-2013)
Published in International Conference on Image Processing, Computer Vision,and Pattern Recognition (IPCV-2014)
Re
se
arc
h P
ub
lic
ati
on
Re
se
arc
h P
ub
lic
ati
on
Innovation for customers
About KPIT Technologies Limited
About CREST
Invitation to Write Articles
Format of the Articles
www.kpit.com .
KPIT a trusted global IT consulting & product engineering partner focused
on co-innovating domain intensive technology solutions. We help
customers globalize their process and systems efficiently through a
unique blend of domain-intensive technology and process expertise. As
leaders in our space, we are singularly focused on co-creating technology
products and solutions to help our customers become efficient,
integrated, and innovative manufacturing enterprises. We have filed for
51 patents in the areas of Automotive Technology, Hybrid Vehicles, High
Performance Computing, Driver Safety Systems, Battery Management
System, and Semiconductors.
Center for Research in Engineering Sciences and Technology (CREST) is
focused on innovation, technology, research and development in
emerging technologies. Our vision is to build KPIT as the global leader in
selected technologies of interest, to enable free exchange of ideas, and to
create an atmosphere of innovation throughout the company. CREST is
recognized and approved R & D Center by the Dept. of Scientific and
Industrial Research, India. This journal is an endeavor to bring you the
latest in scientific research and technology.
Our forthcoming issue to be released in will be based on
. We invite you to share your knowledge by
contributing to this journal.
Your original articles should be based on the central theme of
. The length of the articles should be between
1200 to 1500 words. Appropriate references should be included at the end
of the articles. All the pictures should be from public domain and of high
resolution. Please include a brief write-up and a photograph of yourself
along with the article. The last date for submission of articles for the next
issue is
To send in your contributions, please write to
To know more about us, log on to
October 2014
“Ubiquitous Computing”
“Ubiquitous Computing”
August 28, 2014.
A solution from KPIT saves that trouble
Start on a long drive and the first thing the occupants will do is to adjust their seat position. It can get painful at times if youwant to keep switching seats through the journey. KPIT has built a solution that adjusts the seat positions automatically bysensing the height and knee position of the occupants. A dedicated team worked on varied aspects of Electronics,Mechanical, Hardware and Software to arrive at the best technology solution to be implemented. KPIT took the germ of anidea by its client and we helped our client build a prototype that can become an important feature to improve car sales.
Technology flavors: Engineering Design, Seating systems, Sensor technology
© KPIT Technologies Ltd. Images and trademarks used in KAFÉ are a property of respective owners. Flavors served in KAFÉ are illustrative only and may change in full service.
What if Dominic Toretto had to
adjust his seat each time before
racing in Fast & Furious?
Want to say something? Give a buzz!
y
For private circulation only.
TechTalk@KPIT July - September 2014
35 & 36, Rajiv Gandhi Infotech Park, Phase - 1, MIDC, Hinjewadi, Pune - 411 057, India.
Data is worthless if you don't communicate it.
Thomas H. DavenportBorn on October 17, 1954