Statistics and Data Mining with Perl Data Language

Statistics and Data Mining

with Perl Data Language

Maggie Xiongmaggie at shutterstock.com

PDL::Stats

Get Used to Variability



Know Your Data

� Descriptive statistics

� Frequency distribution aka histogram

� E.g., length of words in search queries

� PDL::Stats::Distr

� $data->plot_distr(‘gaussian’)

Central Tendency and Spread

� Central tendency

� Mean ie average, 5.36

� M = (x1 + x2 + x3 + + xn) / N

� Median – the 50th percentile, 5

� Mode – the most frequent

value, 4

� Spread

� Range, min to max, [1,13]

� Variance, sum of squared

deviations, 3.61

� [(x1 – M)**2 + (x2 – M)**2 +

+ (xn – M) ** 2] / N

Abstraction of the Frequency Distribution

� Mean

� Variance -> standard deviation

� SD = sqrt( variance )

� Normalized score (z score)

� z = (x – M) / SD

� z score and probability

� E.g., client file sizes

Inferential Statistics

� Sample and population

� The amount of chocolate required to finish a task

� Sample 1 mean: pdl(4,2,8,4)->avg; # 4.5

� Sample 2 mean: pdl(4,8,6,6)->avg; # 6

� Sampling distribution of the mean

� Standard error: standard deviation of the means

� SE = standard_deviation / sqrt(N)

� pdl([4,2,8,4],[4,8,6,6])->se; # [1.2583, 0.8165]

� How do we know if the difference is between two samples from the

same population or between two populations?

Hypothesis Testing

� The null hypothesis – H0

� Difficult to assume that the means are different and try to confirm it.

� Instead, assume that the means are not different and see if we have

evidence to reject that assumption.

� H0 – there is no real difference between the means

� Estimate the probability of observing such a difference if the

means come from the same population.

� Reject H0 if p < 0.05, ie. accept that the means are different

� z = (M1 – M2) / SE = (6 – 4.5) / 1.5 = 1

� p = 2 * (1 – gsl_cdf_gaussian_P(abs(z),1)) = 0.317

� Actually we use t- instead of z-distribution for p values.

� t_test(pdl(4,8,6,6), pdl(4,2,8,4)) # 1 6

A/B Test

� Continuous vs. nominal scale

� Binomial distribution

� Two proportion z test

� SE = P(1 - P)(1/N1 + 1/N2)

� P = (x1 + x2) / (N1 + N2)

Relationship between Variables

� Pearson correlation (r)

� Covariance = [(x1-Mx)(y1-My) + .. + (xn-Mx)(yn-My)] / N

� r = COV / SDx * SDy

� r → [-1, 1]

� length(kw) and kw [search_count, download_count, lightbox_count]

� [-0.09, 0.10, 0.07]

� download_count and lightbox_count: 0.92

� zy = r * zx + ε

Linear Regression and Sum of Squares

� The linear model: Y = A0 + A1X1 + A2X2 + + + AnXn

� Estimate values for parameters A1 .. An using observed X and Y scores.

� Given new X’s, calculate predicted Y’s with estimated A1 .. An values.

� RMSE – root mean squared error

� Standard deviation around predicted scores.

� 68% of the time the actual score will fall within 1 RMSE around the

predicted score.

� PDL::Stats::GLM ordinary least squares regression

� ols and ols_t

%m = $y->ols( $x )

print "$_\t$m{$_}\n" for (sort keys %m)

Sum of Squared Deviations (SS)

� SStotal = (x1 – M)**2 + (x2 – M)**2 + + (xn – M) ** 2


� SStotal = (x1 – M)**2 + (x2 – M)**2 + + (xn – M) ** 2

� Var = [(x1 – M)**2 + (x2 – M)**2 + + (xn – M) ** 2] / N

� SD = sqrt(Var)


� SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + + (yn – Ypred.n) ** 2



� SSmodel = SStotal – SSerror



� SSmodel = SStotal – SSerror

� R2 = SSmodel / SStotal

K-means Cluster Analysis

� SStotal = (x1 – Mx)**2 + (x2 – Mx)**2 + + (xn – Mx) ** 2

+ (y1 – My)**2 + (y2 – My)**2 + + (yn – My) ** 2

+ ...

K-means Cluster Analysis

� SSerror = (xc1.1 – Mc1.x)**2 + (xc1.2 – Mc1.x)**2 + + (xcn.n – Mcn.x) ** 2

+ (yc1.1 – Mc1.y)**2 + (yc1.2 – Mc1.y)**2 + + (ycn.n – Mcn.y) ** 2

+ ...

Technology

Statistics and Data Mining with Perl Data Language