45
DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Embed Size (px)

Citation preview

Page 1: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

DATA ANALYSIS

Module Code :CA660(Application Areas: Bio-, Business,

Social, Environment etc.)

Page 2: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

2

STRUCTURE of Investigation/DA

Level of Measurement

Distributional Assumptions, Probability , Estimation properties

Basis: Size/Type of Data Set/Tools

Parametric Non-Parametric

Study techniques Lab. techniques

Estimation/H.T. H.T.

1,2, many samples

E.D., Regn., C.T.

Replication,

Assays, Counts

Page 3: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Probability & Statistics Primer -overview

Note: Short overview. Other statistical distributions in lectures

3

Page 4: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Summary Statistics- DescriptiveIn analysis of practical sets of data, useful to define a small number of values that summarise main features present. We derive (i) representative values, (ii) measures of spread and (iii) measures of skewness and other characteristics.

Representative Values

Sometimes called measures of location or measures of central tendency.

1. Random ValueGiven a set of data S = { x1, x2, … , xn }, we select a random number, say k, in the range 1 to n and return the value xk. This method of generating a representative value is straightforward, but it suffers from the fact that extreme values can occur and successive values could vary considerably from one another.

2. Arithmetic Mean For the set S above, the arithmetic mean (or just mean) is

x = {x1 + x2 + … + xn }/ n.

If x1 occurs f1 times, x2 occurs f2 times and so on, we get the formula

x = { f1 x1 + f2 x2 + … + fn xn } / { f1 + f2 + … + fn } ,

written

i i iii fxfx 4

Page 5: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Example 1. Data are student marks in an examination. Find the average mark for the class.

Note 1: Marks are given as ranges, so care Mark Mid-Point Numberneeded in range interpretation of Range of StudentsAll intervals must be of equal rank and there xi fi fi xi must be no gaps in the classification 0 - 19 10 2 20We interpret the range 0 - 19 to contain marks 21 - 39 30 6 180greater than 0 and less than or equal to 20. 40 - 59 50 12 600Thus, mid-point is 10. The other intervals are 60 - 79 70 25 1750 are interpreted accordingly. 80 - 99 90 5 450 Sum - 50 3000 The arithmetic mean is x = 3000 / 50 = 60 marks.

Note 2: Pivot. If weights of size fi are suspended x1 x2 x xn

from a metre stick at the points xi, then the average is the centre of gravity of thedistribution. Consequently, it is very sensitive f1 f2 fn

to outlying values.

Note 3: Population should be homogenous for average to be meaningful. For example, if assume that typical height of girls in a class is less than that of boys, then average height of all students is neither indicative of the girls nor of the boys. 5

Page 6: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

3. The Mode

This is the value that occurs mostfrequently. By common agreement,it is calculated from the histogram using linear interpolation on the modal class.

The various similar triangles in the diagram generate the common ratios. In our case, the mode is

60 + 13 / 33 (20) = 67.8 marks.

4. The Median

The middle point of the distribution. If { x1, x2, … , xn } are marks of studentsin a class, arranged in nondecreasing order, then the median is the mark of the (n + 1)/2 student.Often use the ogive or cumulative frequency Diagram to calculate. In our case,the median is

60 + 5.5 / 25 (20) = 64.4 marks.

50

Frequency

20

20 40 60 80 100

6

12

25

52

13

13

20

Cumulative

Frequency

10080604020

50

25.5

6

Page 7: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Measures of Dispersion or Scattering

Example 2. The distribution shown has the same Marks Frequencyarithmetic mean as Example 1, but values are more xj fj fj xj

dispersed. Illustrates that an average value alone may not adequately describe statistical 10 6 60distributions. 30 8 240

50 6 300To devise a formula that captures degree to which a 70 15 1050distribution is concentrated about the average, we 90 15 1350consider the deviations of the values from the average. Sums 50 3000If distribution is concentrated around the mean, then deviations will be small, while if it is very scattered,then deviations will be large. The average of the squares of the deviations is called the variance and this is used as a measure of dispersion.

The square root of the variance is the standard deviation , has same units of measurement as the original values and is the preferred measure of dispersion in many applications. x1

x2x3

x4

x5

x6

x 7

Page 8: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Variance & Standard Deviation

s2 = VAR[X] = Average of the Squared Deviations

= S f { Squared Deviations } / S f

= S fi { xi - x } 2 / S fi

= S f xi 2 / S f - x 2 , called the product moment formula.

= s Standard Deviation = Ö Variance

Example 1 Example 2

f x f x f x2 f x f x f x2

2 10 20 200 6 10 60 6006 30 180 5400 8 30 240 720012 50 600 30000 6 50 300 1500025 70 1750 122500 15 70 1050 735005 90 450 40500 15 90 1350 12150050 3000 198600 50 3000 217800

VAR [X] = 198600 / 50 - (60) 2 VAR [X] = 217800 / 50 - (60)2

= 372 marks2 = 756 marks2 8

Page 9: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Other Summary Statistics

SkewnessAn important attribute of a statistical distribution is its degree of symmetry. The “skew” means a tail, so distributions with a large tail of outlying values on the right-hand-side are positively skewed or skewed to the right. The notion of negative skewness is defined similarly. A simple formula for skewness is

Skewness = ( Mean - Mode ) / Standard Deviationwhich for Example 1 is:

Skewness = (60 - 67.8) / 19.287 = - 0.4044.

Coefficient of VariationThis formula was devised to ‘standardise’ the arithmetic mean so comparisons can be drawn between different distributions. Not universally used.

Coefficient of Variation = Mean / Standard Deviation.

Semi-Interquartile RangeThe Median is the mid or 0.5 point in a distribution. The quartiles Q1, Q2, Q3 correspond to the 0.25, 0.50 and 0.75 points. An alternative measure of dispersion is thus

Semi-Interquartile Range = ( Q3 - Q1 ) / 2.

Geometric MeanFor data that grow geometrically, e.g. economic data with high inflation effect, another mean is sometimes used. The G.M. is defined for a product of frequencies, where N = Sf

G. M. = N Ö x1f1 x2 f2 … xk fk

9

Page 10: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Regression

[Example 3.] As a motivating example, suppose we model sales data over time.SALES 3 5 4 5 6 7TIME 1990 1991 1992 1993 1994 1995

Want the straight line “Y = m X + c” that best approximates the data. “Best” in this case is the line which minimizes the sum of squaresof vertical deviations of points from the line: SSQ = SS = ( Yi - [ mXi + c ] ) 2

Setting partial derivatives of SS w.r.t. m and c to zero leads to the “Normal Equations”

Y = m X + n c X Y= m X2 + c X , where n = # points

Let 1990 correspond to Year 0. X.X X X.Y Y Y.Y 0 0 0 3 9 1 1 5 5 25 4 2 8 4 16 9 3 15 5 25 16 4 24 6 36 25 5 35 7 49

55 15 87 30 160

X

Y Yi = m Xi + c

m Xi + c

Yi

0

Xi

Time

Sales10

5

0 5 10

Page 11: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Example 3 - Working

The normal equations are:30 = 15 m + 6 c => 150 = 75 m + 30 c 87 = 55 m + 15 c 174 = 110 m + 30 c

=> 24 = 35 m => 30 = 15 (24 / 35) + 6 c => c = 23/7

Thus the regression line of Y on X isY = (24/35) X + (23/7)

and to plot the line just need two points, soX = 0 => Y = 23/7 and X = 5 => Y = (24/35) 5 + 23/7 = 47/7.

Easy to see that ( X, Y ) satisfies the normal equations, so that the regression line of Y on X passes through “Centre of Gravity” of the data. By expanding terms, get

S ( Yi - Y ) 2 = S( Yi - [ m Xi + c ] ) 2 + S ( [ m Xi + c ] - Y ) 2

Total Sum Error Sum Regression Sumof Squares of Squares of SquaresSST = SSE + SSR

Distinguish the independent and dependent variables (X and Y respectively) X

Y

Yi

mXi +C

Y

X

Y

11

Page 12: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Correlation

The coefficient of determination r2 ( which takes values in the range 0 to 1) is a measure of the proportion of the total variation that is associated with the regression process:

r2 = SSR/ SST = 1 - SSE / SST.

The coefficient of correlation ‘r’ (values in the range -1 to +1 ) is a more common measure of the degree to which a mathematical relationship exists between X and Y. It can be calculated as:

r = ( X - X ) ( Y - Y )

( X - X )2 ( Y - Y ) 2

= n X Y - X Y

[{ n X 2 - ( X ) 2 } { n Y 2 - ( Y ) 2 }]

Example. In our case, r = {6(87) - (15)(30)}/ { 6(55) - (15)2 } { 6 (160) - (30)2 } = 0.907.

r = - 1 r = + 1r = 0

12

Page 13: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Collinearity

For correlation coefficient value > 0.9 or < - 0.9, we would take this to mean that there is a mathematical relationship between the variables. Does not imply that a cause-and-effect relationship exists.

E.g. consider a country with a slowly changing population size, where a certain political party retains a relatively stable percentage of the poll in elections. Let

X = Number of people that vote for the party in an electionY = Number of people that die of a given disease in a yearZ = Population size.

Then, correlation coefficient between X and Y is ~1, indicating a mathematical relationship between them (i.e.) X is a function of Z and Y is a function of Z also. It would clearly be silly to suggest that the incidence of disease is caused by the number of people that vote for the given political party. This is known as the problem of collinearity.

Spotting hidden dependencies is non-trivial. Statistical experimentation can only be used to disprove hypotheses, or to lend evidence to support the view that reputed relationships between variables may be valid. Thus, the fact of a high correlation coefficient between deaths due to heart failure in a given year with the number of cigarettes consumed twenty years earlier does not establish a cause-and-effect relationship, though may be useful to guide research.

13

Page 14: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Overview of Probability Theory

In statistical theory, an experiment is any operation that can be replicated infinitely often and gives rise to a set of elementary outcomes, which are deemed to be equally likely. The sample space S of the experiment is the set of all possible outcomes of the experiment. Any subset E of the sample space is called an event. An event E occurs whenever any of its elements is an outcome of the experiment. The probability of occurrence of E is

P {E} = Number of elementary outcomes in ENumber of elementary outcomes in S

The complement E of an event E is the set of all elements that belong to S but not to E. The union of two events E1 E2 is the set of all outcomes that belong to E1 or to E2 or to both. The intersection of two events E1 E2 is the set of all events that belong to both E1 and E2.

Two events are mutually exclusive if occurrence of either precludes occurrence of the other (i.e) their intersection is the empty set . Two events are independent if occurrence of either is unaffected by occurrence or non-occurrence of the other event.

Theorem of Total Probability.P {E1 E2} = P{E1} + P{E2} - P{E1 E2}

Proof. P{E1 E2} = (n1, 0 + n1, 2 + n0, 2) / n = (n1, 0 + n1, 2) / n + (n1, 2 + n0, 2) / n - n1, 2 / n = P{E1} + P{E2} - P{E1 E2}

Corollary.If E1 and E2 are mutually exclusive, P{E1 E2} = P{E1} + P{E2} - see Axioms and Addition Rule

ES

n = n0, 0 + n1, 0 + n0, 2 + n1, 2

E1 E2

S

n1, 0 n1, 2

n0, 2 n0, 0

14

Page 15: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

The probability P{E1 | E2} that E1 occurs, given that E2 has occurred (or must occur) is called the conditional probability of E1. Note : only possible outcomes of the experiment are confined to E2 and not to S. Theorem of Compound Probability Multiplication Rule.

P{E1 E2} = P{E1 | E2} P{E2}. Proof. P{E1 E2} = n1, 2 / n

= {n1, 2 / (n1, 2 + n0, 2) } { n1, 2 + n0, 2) / n}CorollaryIf E1 and E2 are independent, P{E1 E2} = P{E1} P{E2}. Special case of Multiplication RuleNote: If E itself compound, expands further = Chain Rule: P{E7 E8 E9} =P{E7 (E8 E9)}

Counting possible outcomes of an event is crucial to calculating probabilities. A permutation of size r of n different items, is an arrangement of r of the items, where order of arrangement is important. If order is not important, the arrangement is called a combination.

Example. There are 54 permutations and 54 / (21) combinations of size 2 of A, B, C, D, EPermutations: AB, BA, AC, CA, AD, DA, AE, EA CD, DC, CE, EC

BC, CB, BD, DB, BE, EB DE, ED

Combinations: AB, AC, AD, AE, BC, BD, BE, CD, CE, DE

Standard reference books on probability theory give a comprehensive treatment of how these ideas are used to calculate the probability of occurrence of the outcomes of games of chance.

n1, 0 n1, 2

n0, 2

n0, 0

E1

E2S

15

Page 16: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Bayes’ Rule (Theorem): For a series of mutually exclusive and exhaustive events Br, where union of the Br = B1 B2 B3 …….Br = all possibilities for B,

Then:

Where the denominator is the Total probability of A occurring.

Ex. Paternity indices: based on actual genotypes of mother, child, and alleged father. Before collection of any evidence, have a prior probability of paternity P{C}. So, what is the situation after the genetic evidence ‘E’ is in?

From Bayes’: P {man is father | E} P[E | man is father} P{man is father}

P{man not father | E} P{E | man not father} P{man not father}

Written in terms of ratio of posterior probs. (= LHS), paternity index (L say) and ratio of prior probs. (RHS). Rearrange and substitute in above to give prob. of an alleged man with

particular genotype ‘C’ being the true father

NB: L is a way of ‘weighting’ the genetic evidence; the issue is setting a prior.

rrr

sss BPBAP

BPBAPABP

|

||

=

)1(

|CPPL

CPLPECP

16

Page 17: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Statistical Distributions- CharacterisationIf a statistical experiment only gives rise to real numbers, the outcome of the experiment is called a random variable. If a random variable X

takes values X1, X2, … , Xn

with probabilities p1, p2, … , pn

then the expected (average) value of X is defined to be

E[X] = pj Xj and its variance is

VAR[X] = E[X2] - E[X]2 = pj Xj2 - E[X]2.

Example. Let X be a random variable measuring Prob. Distancethe distance in Kilometres travelled by children pj Xj pj Xj pj Xj

2

to a school and suppose that the following data applies. Then the mean and variance are 0.15 2.0 0.30 0.60

E[X] = 5.30 Kilometres 0.40 4.0 1.60 6.40VAR[X] = 33.80 - 5.302 =5.71 km2 0.20 6.0 1.20 7.20

0.15 8.0 1.20 9.60Similar concepts apply to continuous distributions. 0.10 10.0 1.00 1.00The distribution function is defined by 1.00 - 5.30 33.80

F(t) = P{ X t} and its derivative is the frequency functionf(t) = d F(t) / dt

so that F(t) = f(x) dx.

n

j 1

n

j 1

t

17

Page 18: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Sums and Differences of Random Variables

Define the covariance of two random variables to be COVAR [ X, Y] = E [(X - E[X]) (Y - E[Y]) ] = E[X Y] - E[X] E[Y].

If X and Y are independent, COVAR [X, Y] = 0.

Lemma E[ XY] = E[X] + E[Y]VAR [ X Y] = VAR [X] VAR [Y] 2 COVAR [X, Y]

E[ k. X] = k .E[X] VAR[ k. X] = k2 .E[X] for a constant k.

Example. A company records the journey time X X= 1 2 3 4 Totalsof a lorry from a depot to customers and Y =1 7 5 4 4 20 the unloading times Y, as shown. 2 2 6 8 3 19E[X] = {1(10)+2(13)+3(17)+4(10)}/50 = 2.54 3 1 2 5 3 11E[X2] = {12(10+22(13)+32(17)+42(10)}/50 = 7.5 VAR[X] = 7.5 - (2.54)2 = 1.0484 Totals 10 13 17 10 50

E[Y] = {1(20)+2(19)+3(11)}/50 = 1.82 E[Y2] = {12(20)+22(19)+32(11)}/50 = 3.9VAR[Y] = 3.9 - (1.82)2 = 0.5876

E[X+Y] = { 2(7)+3(5)+4(4)+5(4)+3(2)+4(6)+5(8)+6(3)+4(1)+5(2)+6(5)+7(3)}/50 = 4.36 E[(X + Y)2] = {22(7)+32(5)+42(4)+52(4)+32(2)+42(6)+52(8)+62(3)+42(1)+52(2)+62(5)+72(3)}/50 = 21.04VAR[(X+Y)] = 21.04 - (4.36)2 = 2.0304

E[X Y] = {1(7)+2(5)+3(4)+4(4)+2(2)+4(6)+6(8)+8(3)+3(1)+6(2)+9(5)+12(3)}/50 = 4.82COVAR (X, Y) = 4.82 - (2.54)(1.82) = 0.1972VAR[X] + VAR[Y] + 2 COVAR[ X, Y] = 1.0484 + 0.5876 + 2 ( 0.1972) = 2.0304 18

Page 19: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Standard Statistical Distributions

Most elementary statistical books provide a survey of commonly used statistical distributions.

Importantly, we can characterise them by their expectation and variance (as for random variables) and by the parameters on which these are based; (see lecture notes for those we refer to).

So, e.g. for a Binomial distribution, the parameters are p the probability of ‘success in an individual trial’ and n the No. of trials. The probability of success remains constant – otherwise, another distribution applies.

Use of the correct distribution is core to statistical inference – I.e. estimating what is happening in the population on the basis of a (correctly drawn, probabilistic) sample.The sample is then representative of the population.

Fundamental to statistical inference is the Normal (or Gaussian), with parameters, the mean (or formally expectation of the distribution) and (SD) or variance ( 2). For small samples, or when 2 not known but must be estimated from a sample, a slightly more conservative distribution - the Student’s T or just ‘t’ distribution, applies. Introduces the degrees of freedom concept.

19

Page 20: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Student’s t Distribution

A random variable X has a t distribution with n degrees of freedom ( tn ) .

The t distribution is symmetrical about the origin, withE[X] = 0VAR [X] = n / (n -2).

For small values of n, the tn distribution is very flat. As n is increased the density assumes a bell shape. For values of n ³ 25, the tn distribution is practically indistinguishable from the Standard Normal curve.

O If X and Y are independent random variables If X has a standard normal distribution and Y has a cn

2 distribution then X has a tn distribution (Y / n)

O If x1, x2, … , xn is a random sample from a normal distribution, with mean m and variance s2 and if we define s2 = 1 / ( n - 1) å ( xi - x ) 2

then ( x - m ) / ( s / n) has a tn- 1 distribution

Estimated Sample variance - see calculators ,tables etc.

+ Many other standard distributions20

Page 21: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Sampling Theory

To draw a random sample from a distribution, assign numbers 1, 2, … to the elements of the distribution, use random number tabes or generated set to decide which elements are included in the sample. If the same element can not be selected more than once, we say that the sample is drawn without replacement; otherwise, the sample is said to be drawn with replacement.

Usual convention in sampling is that lower case letters designate the sample characteristics, with capital letters used for the (finite) parent population and greek letters for the infinite. Thus if sample size = n, its elements are designated, x1, x2, …, xn, its mean is x and its modified variance is s2 = å (xi - x )2 / (n - 1).

Corresponding parent population characteristics = N, X and S2 or (, and 2)

Suppose we repeatedly draw random samples of size n (with replacement) from a distribution with mean and variance 2. Let x1, x2, … be the collection of sample means and let

xi’ = xi - m (i = 1, 2, … ) / s Ö n

The collection x1’, x2’, … is called the sampling distribution of means, (usual U or Z)

Central Limit Theorem.In the limit, as sample size n tends to infinity, the sampling distribution of means has a Standard Normal distribution. Basis for Statistical Inference.

21

Page 22: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Attribute and Proportionate Sampling

If sample elements are a measurement of some characteristic, then have attribute sampling. However, if all sample elements are 1 or 0 (success/failure, agree/ do-not agree), we have proportionate sampling. For proportionate sampling, the sample average x and the sample proportion p are synonymous, (just as for mean m and proportion P for the parent population). From our results on the Binomial distribution, the sample variance is p (1 - p) and the variance of the parent distribution is P (1 - P) in the proportionate case.

The ‘sampling distribution’ of means concept generalizes to get the sampling distribution of any statistic. We say that a sample characteristic is an unbiased estimator of the parent population characteristic, i.e. the expectation of the corresponding sampling distribution is equal to the parent characteristic.

Lemma. The sample average (proportion ) is an unbiased estimator of the parent average (proportion):

E [ x] = ;m so E [p] = P.

The quantity Ö ( N - n) / ( N - 1) is called the finite population correction (fpc). If the parent population is infinite or we have sampling with replacement the fpc = 1.

Lemma. E [s] = S fpc for estimated sample S.D. with fpc22

Page 23: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Confidence Intervals

From the statistical tables for a Standard Normal (Gaussian)distribution, we note that

Area Under From To Density Function

0.90 -1.64 1.640.95 -1.96 1.960.99 -2.58 2.58

From the central limit theorem, if x and s2 are the mean and variance of a random sample of size n (with n greater than 25) drawn from a large parent population, size N , then the following statement ,about the unknown parent mean ,m applies

Prob { -1.64 £ x - m 1.64) » 0.90£ s / Ö n

i.e. Prob { x - 1.64 s / Ö n £ m £ x + 1.64 s / Ö n } » 0.90

The range x 1.64 s / Ö n is called a 90% confidence interval for the parent mean .

Example [ Attribute Sampling]A random sample of size 25 has x = 15 and s = 2. Then a 95% confidence interval for m is

15 1.96 (2 / 5) (i.e.) 14.22 to 15.78

Example [ Proportionate Sampling]A random sample of size n = 1000 has p = 0.40 Þ 1.96 Ö p (1 - p) / (n - 1) = 0.03.A 95% confidence interval for P is 0.40 0.03 (i.e.) 0.37 to 0.43.

N (0,1)

0-1.96 +1.96

0.95

23

Page 24: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Small Sampling TheoryFor reference purposes, it is useful to regard the expression

x 1.96 s / nas “default formula” for a confidence interval and to modify it for particular circumstances.

O If we are dealing with proportionate sampling, the sample proportion is the sample mean and the standard error (s.e.) term s / n simplifies as follows:

x -> p and s / n -> p(1 - p) / (n -1). (Also n-1 -> n) O A 90% confidence interval will bring about the swap 1.96 -> 1.64. O For sample size n less than 25, the Normal distribution must be replaced by Student’s t n - 1 distribution. O For sampling without replacement from a finite population, a fpc term must be used.

The width of the confidence interval band increases with the confidence level.

Example. A random sample of size n = 10, drawn from a large parent population, has mean x = 12 and a standard deviation s = 2. Then a 99% confidence interval for the parent mean is

x 3.25 s / Ö n (i.e.) 12 3.25 (2)/3 (i.e.) 9.83 to 14.17and a 95% confidence interval for the parent mean is

x 2.262 s / Ö n (i.e.) 12 2.262 (2)/3 (i.e.) 10.492 to 13.508.

Note: For n = 1000, 1.96 Ö p (1 - p) / n » 0.03 for values of p between 0.3 and 0.7. This gives rise to the statement that public opinion polls have an “inherent error of 3%”. Simplifies calculations in the case of public opinion polls for large political parties.

24

Page 25: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Tests of Hypothesis

[Motivational Example]. It is claimed that average grade of all 12 year olds in a country in a particular aptitude test is 60%. A random sample of n= 49 students gives a mean x = 55% with standard deviation s = 2%. Is the sample finding consistent with the claim?

The original claim regarded as a null hypothesis (H0) which is tentatively accepted as TRUE:H0 : = 60.m

If the null hypothesis is true, the test statisticTS = x - m

s / Ö n is a Random Variable with a Normal (0, 1) i.e.Standardised Normal Z(0,1) (or U(0,1)) distribution.

Thus 55 - 60 = - 35 / 2 = - 17.5 2/ Ö 49 rejection regions

is a random value from Z(0, 1). But this lies outside the 95% confidence interval (falls in the rejection region), so either

(i) The null hypothesis is incorrector (ii) An event with a probability of at most 0.05 has occurred.

Consequently, reject the null hypothesis, knowing a probability of 0.05 exists that we are in error. Technically: reject the null hypothesis at the 0.05 level of significance.The alternative to rejecting H0, is to declare the test to be inconclusive. This means that there is some tentative evidence to support the view that H0 is approximately correct.

Z(0,1)

0.95

1.96-1.96

25

Page 26: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Modifications

Based on the properties of the Normal , Student ‘t’ and other distributions, we can generalise these ideas. If the sample size n < 25, a t n-1 distribution should be used; the level of significance of the test may also be varied or the test applied to a proportionate sampling environment.

Example. 40% of a random sample of 1000 people in a country indicate satisfaction with government policy. Test at the 0.01 level of significance if this consistent with the claim that 45% of the people support government policy?Here, H0: P = 0.45 p = 0.40 n = 1000 so Ö p (1-p) / n = 0.015 test statistic = (0.40 - 0.45) / 0.015 = - 3.3399% critical value = 2.58 so H0 is rejected at the 0.01 level of significance. One-Tailed Tests

If the null hypothesis is of the form H0 : P > 0.45 then arbitrary large values of p are acceptable, so that the rejection region for the test statistic lies in the left hand tail only.

Example. 40% of a random sample of 1000 people in a country indicate satisfaction with government policy. Test at the 0.05 level of significance if this consistent with the claim that at least 45% of the people support government policy?Here the critical value is -1.64, so the null hypothesis H0: P 0.45 ³is rejected at the 0.05 level of significance

N(0,1)

-1.64

0.95

Rejection region26

Page 27: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Suppose that x1 x2 … xm is a random sample, mean x and standard deviation s1

drawn from a distribution with mean m1 and y1 y2 … yn is a random sample, mean y and standard deviation s2

drawn from a distribution with mean m2.

Suppose that we wish to test the null hypothesis that both samples are drawn from the same parent population (i.e.)

H0: m1 = m2.

The pooled estimate of the parent variance is s* 2 = sp

2 = { (m - 1) s12 + (n - 1) s2

2 } / ( m + n - 2)and the variance of ( x – y), is the variance of the difference of two independent random variables, i.e.

sdiff 2 = sp2 / m + sp

2 / n.This allows us to construct the test statistic, which under H0 has a tm+n-2 distribution.

Example. A random sample of size m = 25 has mean x = 2.5 and standard deviation s1 = 2, while a second sample of size n = 41 has mean y = 2.8 and standard deviation s2 = 1. Test at the 0.05 level of significance if the means of the parent populations are identical.Here H0 : m1 = m2 x - y = - 0.3 and sp

2 = {24(4) + 40(1)} / 64 = 2.125so the test statistic is

- 0.3 / (2.125 / 25 + 2.125 / 41) = - 0.811ÖThe 0.05 critical value for Z(0, 1) is 1.96, so the test is inconclusive.

27

Testing Differences between Means

Page 28: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Paired Tests

If the sample values ( xi , yi ) are paired, such as the marks of students in two examinations, then let di = xi - yi be their differences and treat these values as the elements of a sample to generate a test statistic for the hypothesis

H0: m1 = m2.

The test statistic d / sd /Ö n has a tn-1 distribution if H0 is true.

Example. In a random sample of 100 students in a national examination their examination mark in English is subtracted from their continuous assessment mark, giving a mean of 5 and a standard deviation of 2. Test at the 0.01 level of significance if the true mean mark for both components is the same.Here n = 100, d = 5, sd /Ö n = 2/10 = 0.2so the test statistic is then 5 / 0.2 = 10.the 0.01 critical value for a Z(0, 1) distribution is 2.58, so H0 is rejected at the 0.01 level of significance.

Tests for the Variance.

For normally distributed random variables, givenH0: s 2 = k, a constant, then (n-1) s2 / k has a c 2

n - 1 distribution.

Example. A random sample of size 30 drawn from a normal distribution has variance s2 = 5.Test at the 0.05 level of significance if this is consistent with H0 : s 2 = 2 .Test statistic = (29) 5 /2 = 72.5, while the 0.05 critical value for c 2

29 is 45.72, so H0 is rejected at the 0.05 level of significance. 28

Page 29: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Chi-Square Test of Goodness of Fit

Can be used to test the hypothesis H0 that a set of observations is consistent with a given probability distribution. Given a set of categories with observed (Oj ) and expected (Ej ) number of observations (frequency) in each category. Under H0

Test Statistic S (Oj - Ej )2 / Ej has a c 2n - 1 distribution, with n the number of categories.

Example. A pseudo random number generator is used to used to generate 40 random numbers in the range 1 - 100. Test, at the 0.05 level of significance, if the results are consistent with the hypothesis that the outcomes are randomly distributed.

Range 1-25 26 - 50 51 - 75 76 - 100 Total Observed Number 6 12 14 8 40 Expected Number 10 10 10 10 40

Test statistic = (6-10)2/10 + (12-10)2/10 + (14-10)2/10 + (8-10)2/10 = 4.The 0.05 critical value of c 2

3 = 7.81, so the test is inconclusive.

Chi-Square Contingency Test

To test that two random variables are statistically independent, a set of observations can be tabled, with m rows corresponding to categories for one random variable and n columns for the other. Under H0, the expected number of observations for the cell in row i and column j = appropriate (row total column total) (Grand total). Under H0 Test Statistic S S (Oij - Eij )2 / Eij has a c 2

(m -1)(n-1) distribution. 29

Page 30: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Chi-Square Contingency Test - Example

In the following table, the Results Maths History Geography Totalsfigures in brackets are the Honours 100 (50) 70 (67) 30 (83) 200expected values. Pass 130 (225) 320 (300) 450 (375) 900

Fail 70 (25) 10 (33) 20 (42) 100The test statistic is Totals 300 400 500 1200

[ S (Oij - Eij )2 / Eij ] = (100-50)2/ 50 + (70 - 67)2/ 67 + (30-83)2/ 83 + (130-225)2/ 225+ (320-300)2/ 300 + (450-375)2/375 + (70-25)2/ 25 + (10-33)2/ 33 + (20-42)2/ 42 = 248.976

The 0.05 critical value for c 22 * 2 is 9.49 so H0 is rejected at the 0.05 level of significance.

Note: In general, chi-squared tests tend to be very conservative vis-a-vis other tests of hypothesis, (i.e.) they tend to give inconclusive results.

The meaning of the term “degrees of freedom” .In simplified terms, as the chi-square distribution is the sum of, say k, squares of independent random variables, it is defined in a k-dimensional space. When we impose a constraint of the type that the sum of observed and expected observations in a column are equal or estimate a parameter of the parent distribution, we reduce the dimensionality of the space by 1. In the case of the chi-square contingency table, with m rows and n columns, the expected values in the final row and column are predetermined, so the number of degrees of freedom of the test statistic is (m-1)(n-1). 30

Page 31: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

31

Analysis of Variance/Experimental Design-Many samples, Means and Variances

• Analysis of Variance (AOV or ANOVA) was originally devised for agricultural statistics on crop yields etc. Typically, row and column

format, = small plots of a fixed size. The yield yi, j within each plot was recorded.

One Way classification

Model: yi, j = + i + i, j , i ,j ~ N (0, s2) in the limitwhere = overall mean as sample size large i = effect of the ith factor

i, j = error term.

Hypothesis: H0: 1 = 2 = … = m

y1, 3y1, 1 y1, 2

y2, 2

y1, 4

y2, 1

y2, 3

y3, 1 y3, 2

1

2

3

y1, 5

y3, 3

Page 32: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

32

Totals MeansFactor 1 y1, 1 y1, 2 y1, 3 y1, n1 T1 = y1, j y1. = T1 / n1

2 y2, 1 y2,, 2 y2, 3 y2, n2 T2 = y2, j y2 . = T2 / n2

m ym, 1 ym, 2 ym, 3 ym, nm Tm = ym, j ym. = Tm / nm

Overall mean y = yi, j / n, where n = ni

Decomposition (Partition) of Sums of Squares: (yi, j - y )2 = ni (yi . - y )2 + (yi, j - yi . )2

Total Variation (Q) = Between Factors (Q1) + Residual Variation (QE )

Under H0 : Q / (n-1) -> 2n - 1, Q1 / (m - 1) -> 2

m - 1, QE / (n - m) -> 2n - m

Q1 / ( m - 1 ) -> Fm - 1, n - m

QE / ( n - m )

AOV Table: Variation D.F. Sums of Squares Mean Squares F

Between m -1 Q1= ni ( yi. - y )2 MS1 = Q1/(m - 1) MS1/ MSE

Residual n - m QE= (yi, j - yi .)2 MSE = QE/(n - m)

Total n -1 Q = (yi, j. - y )2 Q /( n - 1)

Page 33: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

33

Two-Way Classification Factor I Means

Factor II y1, 1 y1, 2 y1, 3 y1, n y1. : : : : ym, 1 ym, 2 ym, 3 ym, n ym.

Means y. 1 y. 2 y. 3 y . n y . . So we write as y

Partition SSQ: (yi, j - y )2 = n (yi . - y )2 + m (y . j - y )2 + (yi, j - yi . - y . j + y )2

Total Between Between Residual Variation Rows Columns Variation

Model: yi, j = + i + j + i, j with i, j ~ N ( 0, s2)

H0: All i are equal. H0: all j are equal

AOV Variation D.F. Sums of Squares Mean Squares F Between m -1 Q1= n (yi . - y )2 MS1 = Q1/(m - 1) MS1/ MSE

Rows Between n -1 Q2= m (y. j - y )2 MS2 = Q2/(n - 1) MS2/ MSE Columns Residual (m-1)(n-1) QE= (yi, j - yi . - y. j + y)2 MSE = QE/(m-1)(n-1)

Total mn -1 Q = (yi, j. - y )2 Q /( mn - 1)

Page 34: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

34

Two-Way Example

ANOVA outlineFactor I 1 2 3 4 5 Totals Means Variation d.f. SSQ MSQ F Fact II 1 20 18 21 23 20 102 20.4 Rows 3 76.95 25.65 18.86**

2 19 18 17 18 18 90 18.0 Columns 4 8.50 2.13 1.57 3 23 21 22 23 20 109 21.8 Residual 12 16.30 4 17 16 18 16 17 84 16.8

Totals 79 73 78 80 75 385 Total 19 101.75Means 19.75 18.25 19.50 20.00 18.75 19.25

FYI software such as R,SAS,SPSS, MATLAB is designed for analysing these data, e.g. SPSS as spreadsheet recorded with variables in columns and individual observations in the rows. Thus the ANOVA data above would be written as a set of columns or rows, e.g.

Var. value 20 18 21 23 20 19 18 17 18 18 23 21 22 23 20 17 16 18 16 17Factor 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4Factor 2 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Page 35: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

35

DA APPLICATIONS CONTEXT e.g. BIO• GENETICS : 5 branches; aim = ‘Laws’ of Chemistry,

Physics, Maths. for Biology

• GENOMICS : Study of Genomes (complete set of DNA carried by Gamete) by integration of 5 branches of Genetics with ‘Informatics and Automated systems’

• PURPOSE of GENOME RESEARCH : Info. on Structure, Function, Evolution of all Genomes – past and present

• Techniques of Genomics from molecular, quantitative, population genetics: Concepts and Terminology from Mendelian genetics and cytogenetics

Page 36: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

36

CONTEXT: GENETICS - BRANCHES

• Classical Mendelian – Gene and Locus, Allele, Segregation, Gamete, Dominance, Mutation

• Cytogenetics – Cell, Chromasome, Meiosis and Mitosis, Crossover and Linkage

• Molecular – DNA sequencing, Gene Regulation and Transcription, Translation and Genetic Code Mutations

• Population – Allelic/Genotypic Frequencies, Equilibrium, Selection, Drift, Migration, Mutation

• Quantitative – Heritability/Additive, Non-additive Genetic Effects, Genetic by Environment Interaction, Plant and Animal Breeding

Page 37: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

37

CONTEXT+ : GENOMICS -LINKAGES

Mendelian Cytogenetics Molecular

Population Quantitative

GENOMICSGenetic markers

DNA Sequences

Linkage/Physical Maps

Gene Location

QTL Mapping

Page 38: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

38

GENOMICS – some KEY QUESTIONS• HOW do Genes determine total phenotype?• HOW MANY functional genes necessary and sufficient in a

given system?• WHAT are necessary Physical/Chemical aspects of gene

structure? • IS gene location in Genome specific?• WHAT DNA sequences/structures are needed for gene-

specific functions?• HOW MANY different functional genes in whole

biosphere?• WHAT MEASURES of essential DNA sameness in different

species?

Page 39: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

39

‘DATA’ : STATISTICAL GENOMICS

Some UNUSUAL/SPECIAL FEATURES• Size – databases very large e.g. molecular marker and DNA

/ protein sequence data; unreconciled, Legacy• Mixtures of variables - discrete/continuous e.g.

combination of genotypes of genetic markers (D) and values quantitative traits (C)

• Empirical Distributions needed for some Test Statistics e.g. QTL analysis, H.T. of locus order

• Intensive Computation e.g. Linkage Analysis, QTL and computationally greedy algorithms in locus ordering, derivation of empirical distributions, sequence match etc.

• Likelihood Analysis - Linear Models typically insufficient alone

Page 40: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

40

DA APPLICATIONS CONTEXT e.g. BUSINESS/ FINANCE

• http://big.computing.dcu.ie/; http://sci-sym.dcu.ie

• Data-rich environments – under-utilisation of resources• RAW DATA into useful information and knowledge • Similar underpinning: (‘Laws’)– based on analysis• Purpose – Informed decision-making• Techniques – quantitative. Concepts & Nature –

Pervasive, Dynamic, ‘Health’ subject to Internal/External environments. Key elements - Systems and people

• Forecasting/Prediction/Trigger

Page 41: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

4141

CONTEXT+ : FACTORS

Supply Chain

Capital Knowledge & Systems

Labour Globalisation, technology

HEALTH of ENTERPRISE

(governmental, corporate, educational, non-profit)

Adaptability

Page 42: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

FRAMEWORK

• Status: Huge array of information systems &product software.

Challenges: include development, delivery, adoption, and implementation of IT solutions into usable and effective systems that mimic/support organisational processes. ‘KS alignment with work practice.’ (Toffler & Drucker – 80’s : organisations of 20th Century -> knowledge- based. Greater autonomy, revised management structures).

Opportunities: KM popularity grew through 90’s, spawned ideas of 'KM models', ‘KM strategy', concepts of 'organisational learning', 'knowledge /practice networks', 'knowledge discovery', ‘intellectual capital‘).

• Objectives: To Plan, develop, implement, operate, optimise, cost information /communication systems and interpret use.

• Starting point : understanding ICT opportunities requires both technological and organisational perspective + understanding of benefits associated with data capture and analysis.

42

Page 43: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Data Mining & KM

• The Knowledge Discovery Process • Classification e.g. clusters, trees• Exploratory Data Analysis • Models (including Bayesian Networks), Graphical or other. • Frequent Pattern Mining and special groups/subgroups

Key Features: • ‘Learning models’ from data can be an important part of building an

intelligent decision support system. • Sophistication of analyses – computationally expensive data mining

methods, complexity of algorithms, interpretation and application of models.

43

Page 44: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Hot Topics in BI• Business Process Management and Modelling • Supply Chain Management and Logistics • Innovation and ICT • Analytical Information Systems, Databases and Data Warehousing • Knowledge Management and Discovery • Social Networks and Knowledge Communities

• Performance Indicators &Measurement systems/Information Quality • Data Analytics, Integration and Interpretation• Cost-benefit and Impact Analysis • Reference Models and Modelling • Process Simulation and Optimization

• Security and Privacy • IT and IS Architectures/Management • Info. Sys. development, Tools and Software Engineering

44

Page 45: DATA ANALYSIS Module Code :CA660 (Application Areas: Bio-, Business, Social, Environment etc.)

Example Questions• What are the characteristics of internet purchases for a given age-group?

How can this be used to develop further E-business?

• What are key risk factors for profit/loss on a product on the basis of historical data and demographic variables?

• Can we segment into/identify groups of similar on the basis of their characteristics and purchase behaviour?

• Which products are typically bought together in one transaction by customers?

• What are financial projections, given market volatility and knock-on for recent shock?

• What data should an in-house information system collect? What design principles are involved for a large database?

• What is involved in modelling and IT-supported optimisation of key business processes? 45