24
SUMMARY STATISTICS July 2014

Summary statistics (1)

Embed Size (px)

Citation preview

Page 1: Summary statistics (1)

SUMMARY STATISTICS

July 2014

Page 2: Summary statistics (1)

MEASURES OF CENTRAL TENDENCY

• We depend on volumes of data to make various strategic decisions in business.

• Dealing with large volumes of data comes with various challenges

• To the production foreman, detail is of essence but

• To top management, it is better to summarisethe data for easy management because prime interest is on overall profitability

Page 3: Summary statistics (1)

• Two items are always of utmost importance:

– Measure of central tendency (about the middle of the distribution)

– Measure of dispersion (about the centre)

Page 4: Summary statistics (1)

Mean, Mode, Median and Geometric Mean

• The Mean

– It is the sum of the data divided by the number of items constituting the data

Mean = n

∑ xi/n This the same as x1 + x2 + x3….+ xn

i=1

– If we have the following set of measurement:

n= 2, 4, 2, 3, 3, 5, 2, the mean is calculated as

follows:

Page 5: Summary statistics (1)

Ẋ= 2+4+2+3+3+5+2/7

Ẋ= 21/7

Ẋ= 3

Page 6: Summary statistics (1)

The mode

• It is the frequently occurring value in a set of measurement

• If we have the following set of measurement:

n= 2, 4, 2, 3, 3, 5, 2, the mode is calculated as

follows:

1. Arrange the values in terms of magnitude

2, 2, 2, 3, 3, 4, 5

1. Determine the measure that occurs most frequently - 2.

In the above, the mode is 2.

Page 7: Summary statistics (1)

• There is no doubt that the mode is important but:

– It may not be unique to the rest of the set of values

– It cannot be expressed algebraically hence very few statistical operations are developed around it.

Page 8: Summary statistics (1)

The median

• It measures the centrality of values if they are arranged in an ascending order of magnitude, taking into consideration the oddity of some values

• or the arithmetic average of the middle two numbers if the set contains only even numbers.

Page 9: Summary statistics (1)

Example

• Calculate the median of the values below:

2, 4, 2, 3, 3, 5, 2,

Solution:

2, 2, 2, 3, 3, 4, 5

Median = 3 or

3+3

2

= 6/2

= 3

Page 10: Summary statistics (1)

• The median divides the set of values into two halves: one containing values below the median and the other containing values above the median.

• We could also have quartiles (division into 4), deciles (division into tenths) and percentiles (division into hundredths).

• The disadvantage of the median is that it involves laborious arrangement of figures in their order of magnitude.

Page 11: Summary statistics (1)

The Mean of Grouped Data• Raw data may be presented in the form of a

frequency table as in the example below about daily receipts of a shopping mall for 500 days.

Daily receipts (₵) Number of days

< 0<100 10

100<200 30

200<300 50

300<400 80

400<500 100

500<600 85

600<700 75

700<800 40

800<900 25

900<1000 5

Page 12: Summary statistics (1)

The formula:

Mean = Ẋ is given as below if r class intervals are numbered i,…,r and mi, fi are the mid points of and the number of measurements in the ith interval respectively.

n n

Ẋ= ∑ fi mi ∑ fi

i=n i=n

thus, if n is the total number of measurements, the group mean is

n

Ẋ= ∑ fi mi /n

i=n

Page 13: Summary statistics (1)

Daily receipts (₵) Number of days (fi ) Midpoints (mi ) fi mi

< 0<100 10 50 500

100<200 30 150 4,500

200<300 50 250 12,500

300<400 80 350 28,000

400<500 100 450 45,000

500<600 85 550 46,750

600<700 75 650 48,750

700<800 40 750 30,000

800<900 25 850 21,250

900<1000 5 950 4,750

500∑ fi mi = 241,650

Page 14: Summary statistics (1)

n

Ẋ= ∑ fi mi /n

i=n

Ẋ = 241, 650/500

Ẋ = GHC 483.30

Page 15: Summary statistics (1)

Dispersion

• A measure of centrality alone does not provide a sufficiently adequate summary of a set of values

• Consider the two sets of values below(a) -2, -1, 0, 0, 1, 2(b) 0, 0, 0, 0, 0, 0

In both sets, the mean, mode and median are all 0. The difference in character doesn’t lie in their centrality but in their variation about the central valueIn measurement of dispersion, there are various statistics:

– Standard Deviation and Variance are of utmost importance– The Range (difference between the highest and lowest

measurement– Inter-quartile range (difference between the median of the

higher and lower quartiles ).

Page 16: Summary statistics (1)

Standard Deviation and Variance• Finding the standard deviation (δ2) of a set of values

given as:

2, 4, 2, 3, 3, 5, 2,

– Find the mean of the distribution

2+4+2+3+3+5+2/7 =21/7= 3

– Find the deviation of each measure from the mean

-1, 1, -1, 0, 0, 2, -1

– Square the deviation of each measure from the mean

1, 1, 1, 0, 0, 4, 1

Page 17: Summary statistics (1)

– Sum the squared deviations

1+1+1+0+0+4+1 =8

– Divide the sum of the deviations by the total number of measurements to give you the variance (δ) = 8/7=1.142857

– Take the square root of the variance to deal with any distortion = √8/7 or √1.142857 =1.069045

Therefore, δ2 = 1.069045

NOTE: S.D = δ2

Page 18: Summary statistics (1)

Formula

• This is given by the formula:n n n

s = √ ∑ (xi- Ẋ) 2 0R δ2 = √ ∑ xi2 - (∑ xi - Ẋ) 2

i=1 n i=1 i=1n n

n

Ẋ= (∑ xi) /n i=n

Large sample:S = √ ∑f(X- Ẋ)2

∑f

Sample:S = √ ∑f(X- Ẋ)2

n-1

Page 19: Summary statistics (1)

REGRESSION AND CORRELATION ANALYSIS

• Statistics involve analysis of variables : – Dependent variable – Independent variable

• Regression analysis concerns an explanation of the exact dependence of one variable on another)

• Correlation analysis measures the degree of dependence of one variable on the other

• Both regression and correlation analysis study the form of association between a set of variables

• The power of these analysis is prediction of the effect of a given variable on another variable given the former variable.

• E.g. we can predict output levels given the man hours, we could predict sales volume given an amount of money spent on promotion.

Page 20: Summary statistics (1)

Linear regressionY = mX + c

This is a simple linear equation where X and Y are variables and m and c are constants.

Y = 4x +6Y = 3.5x +7.2 Y = 13.8x + 76.1

are all examples of linear equations that can be represented graphically to show a straight line relationship but the objective is a determination of a Best Fit on a scatter diagram through partial differentiation.

It is also possible to have non-linear relationships whose graphical representation is not in the form of a straight line.

Page 21: Summary statistics (1)

Least Squares Regression

• In the least squares regression analysis, we are given two functions to solve e.g.

∑yi = nc + m ∑ xi ………………………………(1)

∑xi yi = c∑ xi + m∑ xi2 ……………………….(2)

To minimize,

m= n ∑xi yi – (∑ xi )(∑yi ) ………………………………(1

n ∑ xi 2 – (∑ xi

2 )

c = (∑yi)(∑ xi 2) - (∑ xi)(∑xi yi)

n(∑ xi 2) - (∑ xi )

2

= 1/n (∑yi - m ∑ xi ) …………………………….(2)

OR Y= (∑XY/ ∑Y)X, Where: x =X- Ẋ and y = y-Ӯ

Page 22: Summary statistics (1)

Example• The output of A&B Co is given in table 1. You are to

provide the best fit using the least square approach.

Weeks Total Output (X) Independent variable

Total Cost (Y) –DependentVariable

1 2 11.2

2 3 15.6

3 5 20.3

4 4 20.8

5 1 7.8

6 3 10.6

7 2 12.3

8 4 21.5

9 5 22

10 6 27.6

Page 23: Summary statistics (1)

SOLUTION

Weeks Total Output (X)

Total Cost (Y)

XY X2 Y2

1 2 11.2 22.4 4 125.44

2 3 15.6 46.8 9 243.36

3 5 20.3 101.5 25 412.09

4 4 20.8 83.2 16 432.64

5 1 7.8 7.8 1 60.84

6 3 10.6 31.8 9 112.36

7 2 12.3 24.6 4 151.29

8 4 21.5 86 16 462.25

9 5 22 110 25 484

10 6 27.6 165.6 36 761.76

TOALT 35 169.7 679.7 145 3,246.03

Page 24: Summary statistics (1)

• Given:Y= (∑XY/ ∑Y)X, Where: x =X- Ẋ and y = y-Ӯ

Then,Y= (679.7/169.7)35

Orm=(10 x 679.7) – (35x169.7)/(10x145) – (35)2 = 3.811

c= 1/10({169.7 – (3.811 x 35)} = 3.63

Y=3.811X + 3.63