Preliminaries of Data Science

Preliminaries of Data Science

Vinay Kothamasi Manager 1, SE AnalystDell EMC [email protected]

Aroun Kesavaraj Senior SE [email protected]

Knowledge Sharing Article © 2017 Dell Inc. or its subsidiaries.

2017 Dell EMC Proven Professional Knowledge Sharing 2

Table of Contents

Introduction ................................................................................................................................ 3

Types of Analytics ...................................................................................................................... 4

Descriptive Analytics .............................................................................................................. 4

Diagnostic Analytics ............................................................................................................... 4

Predictive Analytics ................................................................................................................ 4

Prescriptive Analytics ............................................................................................................. 4

Data Science Process ................................................................................................................ 5

Probability and Statistics ............................................................................................................ 6

Random Variable: ................................................................................................................... 6

Probability Distributions – Binomial and Poisson .................................................................... 8

Central Limit Theorem: ........................................................................................................... 9

Statistics ...................................................................................................................................11

Visualizing Statistics ..............................................................................................................11

Simulation and Hypothesis Testing ...........................................................................................13

Simulations ............................................................................................................................13

Hypothesis Testing ...................................................................................................................16

Hypothesis Tests and P-Values .............................................................................................16

Types of Test .........................................................................................................................17

Single-Sample T-Tests and Z-Tests ...................................................................................17

Two-Sample Tests.................................................................................................................18

Conclusion ................................................................................................................................18

References ...............................................................................................................................19

Disclaimer: The views, processes or methodologies published in this article are those of the

authors. They do not necessarily reflect Dell EMC’s views, processes or methodologies.


Introduction

The term "data science" has existed for over thirty years and was used initially as a substitute

for computer science by Peter Naur in 1960. Data Science is all about using data, to make

decisions that drive actions. The Ggoal of Data science is to explore and perform quantitative

analysis of all available structured and unstructured data, to further develop, understand, extract

knowledge, and formulate actionable results.

Data Science plays a major role in solving real time problems. To name a few, they are

understanding customer behavior patterns, predictive analysis rather than reactive across all

industries like health care, public and private sectors. This Knowledge Sharing article talks

about the concepts that serve as fundamentals of data science.

So, what is the Data Science?

Data Science is the exploration and quantitative analysis of all available

structured and unstructured data to develop understanding, extract knowledge,

and formulate actionable results.

https://en.wikipedia.org/wiki/Computer_science

https://en.wikipedia.org/wiki/Peter_Naur


Types of Analytics

Figure 1 – Types of Analytics

Descriptive Analytics is a preliminary stage of data processing that creates a summary of

historical data to yield useful information and possibly prepare the data for further analysis.

Diagnostic Analytics is a form of advanced analytics which examines data or content to

answer the question “Why did it happen”, and is characterized by techniques such as drill-down,

data discovery, data mining, and correlations.

Predictive Analytics is the branch of advanced analytics used to make predictions about

unknown future events. Predictive analytics uses techniques such as data mining, statistics,

modelling, machine learning, and artificial intelligence to analyze current data to make

predictions about future. We will discuss Predictive Analytics further, in article.

Prescriptive Analytics is the area of Business Analytics (BA) dedicated to finding the best

course of action for a given situation. Prescriptive Analytics is related to both Descriptive and

Predictive Analytics.


Data Science Process

Data Science is an iterative process. Data Science process like CCC, KDD and Crisp Data

Mining Process of Data Science – described below – are often termed “Life Cycle of Data

Science”.

CCC process – The Computing Community Consortium Big Data Whitepaper

(2012).

KDD Process - Knowledge Discovery in Databases (KDD) process (1997)


CRISP-DM Process - “The Cross Industry Standard Process for Data Mining

(CRISP-DM) (2000)”

Note that CRISP-DM talks about Business Understanding and Data Understanding as well,

which is not mentioned in KDD and CCC Process.

Probability and Statistics

The probability of an event occurring is the number in the event divided by the number in the

sample space. Again, this is only true when the events are equally likely. A

classical probability is the relative frequency of each event in the sample space when each

event is equally likely.

Random Variable: A random variable is a variable whose value is unknown or a function that

assigns values to each of an experiments’ outcome. Example – Roll of a Die

Random variables can be discrete or continuous. Discrete variables have a countable number

of distinct outcomes; for example, the number of cookies in a jar, or a numeric identifier

associated with a day of the week. Continuous variables have real-valued outcomes, for

example, the temperature on a given day or the volume of water flowing in a stream over a

specified period.


“Probability” measures the likelihood of a random variable to take on a specified range of

outcomes. For example, consider a random variable that is the store sales on Saturday at a

particular store. The probability to have sales between $7000 and $9000 could be .43.

Probability is always expressed as a numeric value, and the total probability for all outcomes of

any random variable is always 1 (i.e. 100%). For example, consider the following table, which

shows a discrete random variable that represents the days of the week together with the

probability that a particular sales transaction occurred on that day:

The table in previous page describes the probability mass function (PMF) for the random

variable X – it identifies the probability (P) for each outcome of random variable (X). For

example, the probability of the weekday of a sales transaction being Monday (or in other words,

the probability of the value of the variable being 1) is 1/8 (which is the same as 0.125 or 12.5%).

This is written as P(X=1) = 1/8, P(X=2) = 1/8 and so on. Note that all of the probabilities add up

to 1.

When you know the PMF for a random variable, you can summarize the random variable by

determining its mean and variance. The mean of the variable indicates its centrality – in other

words, its average value. It’s also known as the variables expected value and is represented by

the μ symbol (mu). To calculate the mean, multiply each outcome by its probability and total the

results. In the example of the weekday variable above, the calculation is:

(0.125 x 1) + (0.125 x 2) + (0.125 x 3) + (0.125 x 4) + (0.125 x 5) + (0.1875 x 6) + (0.1875 x 7) =

this yields the result 4.3125.


Variance indicates the spread of variable values from the mean. It is a squared value,

represented by the symbol σ2. To measure the variance, multiply the probability of each

outcome by the (outcome minus the mean squared), and add together the results for each

value.

(0.125 x [1 – 4.3125]²) + (0.125 x [2 – 4.3125]²) + (0.125 x [3 – 4.3125]²) + (0.125 x [4 –

4.3125]²) + (0.125 x [5 – 4.3125]²) + (0.1875 x [6 – 4.3125]²) + (0.1875 x [7 – 4.3125]²) = this

yields the result 4.214844.

Since the variance is in squared units, it makes sense to calculate its square root – which is

known as the standard deviation (or σ). In this case, the standard deviation from the mean for

the week day variable is √4.214844, which yields 2.053008.

Probability Distributions – Binomial and Poisson

In probability theory and statistics, the binomial distribution with parameters n and p is the

discrete probability distribution of the number of successes in a sequence of “n” independent

yes/no experiments, each of which yields success with probability “p”.

Poisson distribution is a discrete frequency distribution which gives the probability of a number

of independent events occurring in a fixed time.

Both the Binomial and Poisson distributions can be used to calculate probability for discrete

random variables. For continuous variables, there are no discrete values in a PMF table, so the

probability is expressed as a curve known as the probability density function (PDF). Probabilities

of the variable value being within a specified range are calculated based on the area under the

curve, the total of which always adds up to 1.


For example, here’s a PDF for a continuous variable that shows the probability that a variable

value is less than 7.

The curve in a PDF defines the cumulative distribution function for the variable, in which the

function of a value is less than the probability of the variable having that value; or expressed as

a formula, F(x) = P(X < x).

Central Limit Theorem: The central limit theorem (CLT) is a statistical theory that states that

given a sufficiently large sample size from a population with a finite level of variance, the mean

of all samples from the same population will be approximately equal to the mean of the

population.

Example of Central Limit Theorem

With μ=3 and σ=1.73. The distribution is shown in the figure below. This population is not

normally distributed, but the Central Limit Theorem will apply if n > 30. Note that n=10 does not

meet the criterion for the Central Limit Theorem, and the small samples on the right give a

distribution that is not quite normal.



Statistics

Data science is largely concerned with statistical relationships and distributions of data.

Visualizing Statistics

One of the first things a Data Scientist should do with data is to look at it – often by creating

visualizations that show the comparative frequency with which different data values occur or plot

relationships between different variables. Often, Histograms are used to view probability

distributions.

Bar Charts are useful for plotting categorical data


A Pareto chart is a bar chart with the data in descending order.

A scatter plot is used to show the intersections of two numeric variables.


Box Plot is a simple way of representing statistical data on a plot in which a rectangle is drawn

to represent the second and third quartiles, usually with a vertical line inside to indicate the

median value. The lower and upper quartiles are shown as horizontal lines either side of the

rectangle.

Simulation and Hypothesis Testing

Simulations

Data scientists often need to use statistical methods to experiment with data and model real-

world scenarios. When the data consists of a known number of independent random draws from

a simple distribution, quantities of interest can often be estimated relatively simply.

However, many real-world scenarios are more complex, and cannot be easily modeled as

arising from a normal distribution or another simple distribution. In these cases, you can use a

simulation to model the variables and gain an understanding of how the scenario is likely to

work in reality.

To run a simulation, you must identify the possible outcomes for each random variable in the

scenario along with its probabilities, and the relationships between random variables. This

defines the probability distribution you are working with. You then create a number of random

draws and look at the outcomes; this is the simulation.

For example, suppose you need to model customer satisfaction at a store where each customer

can rate service as 1 for poor, 2 for acceptable, and 3 for excellent. The individual ratings are


then totaled each day to give an overall satisfaction score. There are two random variables that

need to be taken into consideration for the scenario: the number of customers and the ratings

they give.

For this example, we’ll assume that the number of customers can be represented by a normal

distribution with a mean of 500 and a standard deviation of 20; and that 50% of the time these

customers tend to give a rating of 2, 20% of the time they give a rating of 1, and 30% of the time

they give a rating of 3.

Using these suppositions, you can run the simulation a large number of times, generating

random values for the two variables based on their probability distributions, and use the results

to model the likely distribution of total satisfaction scores. In this case, the distribution of

customers for 100,000 runs (or realizations) of the simulation looks like this:


The mean number of customers per day is 500, and this was achieved around 3,000 days out of

the total 100,000 simulated. The ratings given by those customers look like this:

Out of 100,000 realizations, around half of them (50,000) produce a rating of 2. There are 2,000

instances of a rating of 1, and 3,000 instances of a rating of 3. This corresponds with the

probability we assumed for customer ratings. When we combine the results of the simulations

for both variables, we can see the likely distribution of total satisfactions scores below:


Hypothesis Testing

Hypothesis testing is a core skill in statistics. Hypothesis tests are used to evaluate data and

determine whether or not a hypothesis could be supported by the dataset.

Hypothesis Tests and P-Values

A hypothesis test uses statistics to answer a yes or no question about some data set, and the

result tells you whether to reject a null hypothesis (which often represents the view that

conditions have not changed) in favor of an alternative hypothesis (which represents a reason

for the observed result different than the null hypothesis). For example, suppose a cupcake

store sells chocolate and vanilla cupcakes. You might suspect that each customer will have a

preference for a particular flavor, and that more customers might prefer one flavor (for example,

chocolate) over the other (vanilla). The null hypothesis (which we label H0) for your test is that

customers will choose chocolate or vanilla in equal numbers (in other words, there is a 50%

probability of the customer choosing vanilla, and a 50% probability that their preference will be

chocolate). The alternative hypothesis (H1) is that there is some non-equal preference on the

choice of flavor so that the probability of a particular flavor (say. chocolate) being chosen is not

50%.

These hypotheses can be expressed as:

H0: P = 0.5

H1: P ≠ 0.5

Given a suitably large sample of cupcake sales data, you can determine the actual number of

chocolate cupcakes sold compared to the total sales, and work out how probable that result is if

the null hypothesis is true. For example, suppose the sample data includes 100 sales; 70 of

which were for chocolate cupcakes, and 30 of which were for vanilla cupcakes. If each cupcake

sold has an even 50% probability of being either chocolate or vanilla (as stated in the null

hypothesis), then based on a binomial distribution, the probability of 70 out of 100 cupcakes

sold being chocolate flavored is approximately 0.0023%. Note that the hypotheses are always

about the population, and not about the sample. (The population mean is unknown, but the

sample mean is not, thus we do not need to create hypotheses about the sample.)


The probability to observe what we did (or something more extreme) under the null hypothesis

is known as the P-value. Based on a pre-determined threshold known as the significance level

that you decide, you can reject the null hypothesis or not. In most cases, a value of around 0.05

(or 5%) is chosen as the significance level, and in this case the P-value is much lower than this

so the null hypothesis can be rejected in favor of the alternative hypothesis. If the P-value is

lower than the significance level, we reject the null hypothesis.

Types of Test

There are numerous types of hypothesis test that you can conduct, depending on the type of

data and the alternative hypothesis you are trying to validate. Many tests are focused on

evaluating the mean of a given dataset and comparing it to an expected value.

Single-Sample T-Tests and Z-Tests

Suppose our cupcake store expects to sell an average of 75 or more cupcakes per day. You

could record actual sales figures over a period of time and perform a test to determine whether

the mean sales figure is greater than 74. Depending on the volume of sample data available,

you can perform a z-test (which you should use for normal distributions with a known population

standard deviation, or for data sets with more than 30 independent observations – in which case

the sample standard deviation is close enough to the population standard deviation) or a t-test

(which can be used with a small number of observations and when the population standard

deviation is not known).

The result of the z-test or t-test includes a p-value, which you can use to determine whether or

not to reject the null hypothesis. In this case, the null hypothesis is that the mean sales amount

will be 75 or more, and the alternative hypothesis is that average daily sales will be less than 75.

This can be expressed like:

H0: P >=75

H1: P < 75

This is an example of a one-tailed test, in which we are testing whether or not the population

mean could be greater than a specified value. You could also perform a one-tailed test to

determine whether or not the population mean is less than the expected value, or you could

perform a two-tailed test to determine whether or not the population mean varies from the

expected value in either direction.


Two-Sample Tests

In addition to single-sample tests, you can perform tests that compare two samples. For

example, suppose you want to test the hypothesis that on average, chocolate cupcakes weigh

more than vanilla cupcakes. To test this hypothesis, you can individually weigh a set of

chocolate cupcakes and a set of vanilla cupcakes, and then conduct a t-test that compares the

mean weight of each set. The resulting p-value will indicate the significance of the difference in

mean weights.

Comparing the mean weights of two different cupcake flavors is an example of an unpaired test.

The individual observations (the measured weights of each cupcake) are independent – you

could even include more vanilla cupcakes than chocolate cupcakes (or vice-versa) without

affecting the outcome of the test. However, some two-sample tests are paired tests in which

there is a dependency between the observations in the two datasets. For example, suppose you

wanted to test the hypothesis that the daily average sales figure of chocolate cupcakes is higher

than that of vanilla cupcakes. In this case, the two sets of observations must be paired so that

the first observation in each sample is the total flavor-specific sales for the first day, the second

observation is the total flavor-specific sales for the second day, and so on.

Conclusion

Data Science is an iterative process which is about using data to make decisions that drive

actions. Data Science plays a major role in replacing intuition-based decision making with

analytical-based decisions. This will predominantly help in making value-based and futuristic

decisions. Data Science also helps in transforming raw data of a company into a valuable asset

that also increases the pace of the decision-making process.


References

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Data_science

Free online course on Data Science

https://www.edx.org/course/data-science-orientation-microsoft-dat101x-1

https://www.edx.org/course/data-science-essentials-microsoft-dat203-1x-2

Dell EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO

RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE

INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an

applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Data_science

https://www.edx.org/course/data-science-orientation-microsoft-dat101x-1

https://www.edx.org/course/data-science-essentials-microsoft-dat203-1x-2

Documents

Preliminaries of Data Science