Click here to load reader
Upload
maggiexyz
View
1.843
Download
5
Embed Size (px)
Citation preview
Statistics and Data Mining
with Perl Data Language
Maggie Xiongmaggie at shutterstock.com
PDL::Stats
Get Used to Variability
Get Used to Variability
Get Used to Variability
Know Your Data
� Descriptive statistics
� Frequency distribution aka histogram
� E.g., length of words in search queries
� PDL::Stats::Distr
� $data->plot_distr(‘gaussian’)
Central Tendency and Spread
� Central tendency
� Mean ie average, 5.36
� M = (x1 + x2 + x3 + + xn) / N
� Median – the 50th percentile, 5
� Mode – the most frequent
value, 4
� Spread
� Range, min to max, [1,13]
� Variance, sum of squared
deviations, 3.61
� [(x1 – M)**2 + (x2 – M)**2 +
+ (xn – M) ** 2] / N
Abstraction of the Frequency Distribution
� Mean
� Variance -> standard deviation
� SD = sqrt( variance )
� Normalized score (z score)
� z = (x – M) / SD
� z score and probability
� E.g., client file sizes
Inferential Statistics
� Sample and population
� The amount of chocolate required to finish a task
� Sample 1 mean: pdl(4,2,8,4)->avg; # 4.5
� Sample 2 mean: pdl(4,8,6,6)->avg; # 6
� Sampling distribution of the mean
� Standard error: standard deviation of the means
� SE = standard_deviation / sqrt(N)
� pdl([4,2,8,4],[4,8,6,6])->se; # [1.2583, 0.8165]
� How do we know if the difference is between two samples from the
same population or between two populations?
Hypothesis Testing
� The null hypothesis – H0
� Difficult to assume that the means are different and try to confirm it.
� Instead, assume that the means are not different and see if we have
evidence to reject that assumption.
� H0 – there is no real difference between the means
� Estimate the probability of observing such a difference if the
means come from the same population.
� Reject H0 if p < 0.05, ie. accept that the means are different
� z = (M1 – M2) / SE = (6 – 4.5) / 1.5 = 1
� p = 2 * (1 – gsl_cdf_gaussian_P(abs(z),1)) = 0.317
� Actually we use t- instead of z-distribution for p values.
� t_test(pdl(4,8,6,6), pdl(4,2,8,4)) # 1 6
A/B Test
� Continuous vs. nominal scale
� Binomial distribution
� Two proportion z test
� SE = P(1 - P)(1/N1 + 1/N2)
� P = (x1 + x2) / (N1 + N2)
Relationship between Variables
� Pearson correlation (r)
� Covariance = [(x1-Mx)(y1-My) + .. + (xn-Mx)(yn-My)] / N
� r = COV / SDx * SDy
� r → [-1, 1]
� length(kw) and kw [search_count, download_count, lightbox_count]
� [-0.09, 0.10, 0.07]
� download_count and lightbox_count: 0.92
� zy = r * zx + ε
Linear Regression and Sum of Squares
� The linear model: Y = A0 + A1X1 + A2X2 + + + AnXn
� Estimate values for parameters A1 .. An using observed X and Y scores.
� Given new X’s, calculate predicted Y’s with estimated A1 .. An values.
� RMSE – root mean squared error
� Standard deviation around predicted scores.
� 68% of the time the actual score will fall within 1 RMSE around the
predicted score.
� PDL::Stats::GLM ordinary least squares regression
� ols and ols_t
%m = $y->ols( $x )
print "$_\t$m{$_}\n" for (sort keys %m)
Sum of Squared Deviations (SS)
� SStotal = (x1 – M)**2 + (x2 – M)**2 + + (xn – M) ** 2
Sum of Squared Deviations (SS)
� SStotal = (x1 – M)**2 + (x2 – M)**2 + + (xn – M) ** 2
� Var = [(x1 – M)**2 + (x2 – M)**2 + + (xn – M) ** 2] / N
� SD = sqrt(Var)
Sum of Squared Deviations (SS)
� SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + + (yn – Ypred.n) ** 2
Sum of Squared Deviations (SS)
� SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + + (yn – Ypred.n) ** 2
� SSmodel = SStotal – SSerror
Sum of Squared Deviations (SS)
� SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + + (yn – Ypred.n) ** 2
� SSmodel = SStotal – SSerror
� R2 = SSmodel / SStotal
K-means Cluster Analysis
� SStotal = (x1 – Mx)**2 + (x2 – Mx)**2 + + (xn – Mx) ** 2
+ (y1 – My)**2 + (y2 – My)**2 + + (yn – My) ** 2
+ ...
K-means Cluster Analysis
� SSerror = (xc1.1 – Mc1.x)**2 + (xc1.2 – Mc1.x)**2 + + (xcn.n – Mcn.x) ** 2
+ (yc1.1 – Mc1.y)**2 + (yc1.2 – Mc1.y)**2 + + (ycn.n – Mcn.y) ** 2
+ ...