eS2 - Stat2Var - TEX CORR - Rev 2016jff-dut-tc.weebly.com/uploads/1/4/7/9/...rev_2016.pdf · Two candidates compete for a presidential election: NS and FH. In a little town, there

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2016 – page 1 on 18

SALES AND MARKETING Department

MATHEMATICS

2nd Semester

________ Bivariate statistics ________

SOLUTIONS of tutorials and exercises

Online document: http://jff-dut-tc.weebly.com section DUT Maths S2


Exercise 1. (Tutorial for lesson page 5)

Are people’s behavior in relation to tobacco and people’s gender independent, with a 10% significant level?

Here are the results of a survey made on a sample of 51 men and 66 women:

G : variable "gender" B : variable "behavior in relation to tobacco"

Gm : men Bn : never smoked

Gw : women Bs : smoke

Bss : stopped smoking

observed

frequencies:

theoretical frequencies

according to H0: Detailed Chi-squares and total:

Gm Gw Gm Gw Gm Gw

Bn 12 23 35 Bn 15.26 19.74 35 Bn 0.69507 0.53710

Bs 31 26 57 Bs 24.85 32.15 57 Bs 1.52417 1.17777

Bss 8 17 25 Bss 10.90 14.10 25 Bss 0.77038 0.59529

51 66 117 51 66 117 5.300

1) Place the subtotals and the general total in the first table, and in the second one, identically.

2) Fill the second table (6 central th values) following proportional calculations.

3) Table #3: calculate the six Chi-square, then add them to get the value χ²calc.

4) Test writing:

Null hypothesis: H0 : Gender and tobacco behavior are independent

Observed χ²

Value of the variable χ² between the observed and the theoretical samples: χ²calc = 5.3

Rejection area

Significance level: α = 10 %

Number of dof: (r-1)(k-1) = (3 – 1)(2 - 1) = 2

Value of the variable χ² limit until rejection : χ²lim = 4.61

Comparison and decision:

As χ²calc > χ²lim , H0 can be rejected, at a 10% significance level.

In other words, we can say with less than 10% risk of being wrong, that men and women behave

differently with tobacco. However, we could not reject our null hypothesis at a 5% significance level:

χ²lim is 5.99 in such conditions, and so isn’t reached by χ²calc , thus showing us that claiming dependence

is done with more than 5% risk of being wrong.


Exercise 2.

Two candidates compete for a presidential election: NS and FH. In a little town, there are 500 voters. 100 are

retired people, 50 are unemployed and 350 are employees. There, the vote results are:

candidates FH NS

blank/

abstention voters

unemployed 24 16 10

employees 122 148 80

retired 36 27 37

1) Decide, with a 1% significance level, whether people’s opinion depends on their social group or not.

* H0 : "The type of vote is independent of the social group"

* Let’s perform the necessary calculations in order to get χ²calc:

observations

in theory (indep.)

Chi-square

24 16 10 50

18.2 19.1 12.7 50

1.848 0.503 0.574

122 148 80 350

127.4 133.7 88.9 350

0.229 1.529 0.891

36 27 37 100

36.4 38.2 25.4 100

0.004 3.284 5.298

182 191 127 500

182 191 127 500

Chi²calc = 14.16

* Rejection area: with α = 1 %, and with 4 degrees of freedom : Chi²lim = 13.28

* Decision: as Chi²calc > Chi²lim, we can reject H0 (so: claim that People’s opinion depends on their social group)

with a 1% chance of being wrong.

2. What can we say if we do not include blank votes and abstentions?

Let’s take back the analysis, excluding blank votes and abstentions:

* observations

in theory (indep.)

Chi-square

24 16 40

19.52 20.48 40

1.03 0.981

122 148 270

131.7 138.3 270

0.72 0.687

36 27 63

30.74 32.26 63

0.9 0.858

182 191 373

182 191 373

Chi²calc = 5.175

* with 2 dof : Chi²lim = 5.991 with α = 5 % and Chi²lim = 4.605 with α = 10 %.

We can assess that people’s opinion depends on their social group, with 10 % chances of being wrong, but we

couldn’t assess it if we wanted to take only 5 % chances of being wrong.

Exercise 3.

The table shows attendance in two stores A and B: how many people

made at least one purchase. These clients have been sorted by age group

(10 to 15 years, and so on).

1. Say, with a 5% significance level, whether the chosen store depends on

the age of a client.

store

age

A B

10 - 15 46 24

15 - 20 29 35

20 - 40 14 17

> 40 12 18

*

store

store

store

obs

A B

th

A B

χ² A B

10 to 15 46 24 70 10 to 15 36.26 33.74 70 10 to 15 2.6185 2.8135 5.4320

15 to 20 29 35 64 15 to 20 33.15 30.85 64 15 to 20 0.5192 0.5579 1.0771

20 to 40 14 17 31 20 to 40 16.06 14.94 31 20 to 40 0.2634 0.2830 0.5464

40 + 12 18 30 40 + 15.54 14.46 30 40 + 0.8058 0.8658 1.6716

101 94 195

101 94 195

4.2069 4.5202

8.727

* with 3 dof and a 5% level, the table gives χ²lim = 7.815.

* Thus, this limit value has been exceeded. With a 5 % significance level, we can reject the hypothesis that

the choice of the store and the age group are independent.


2) What age group mostly contributes to the previous result? Explain.

The age group « 10 to 15 year old » mostly contributes to the total χ². It could be easily stated that people

that are over 15 year old show quite the same purchasing behavior. On the contrary, the first age group

shows a very different frequency distribution (first table, in blue), compared to other customers.

3) Give the meaning of the “5% significance level” on your first answer.

We assume the dependence between age and chosen store with a 5 % chance to be wrong.

4) According to your Chi² table, can you be more accurate about the chance taken in this statement (your first

answer)?

If we wanted to reach a 2% level, χ²calc would have been more than 9.837, but our value isn’t. So, the χ²

table (form) doesn’t allow us to say more than “the risk is between 2% and 5%”.

Exercise 4.

In a survey, 100 people were asked about their age and their attendance at theaters (cinema). We name X the

variable "age" and Y the variable "number of annual cinema shows". The survey result is the following table of

quotes (fr.: citations) :

Y X [15 ; 25[ [25 ; 50[ ≥ 50

none 4 6 13

1 to 11 10 16 15

12 to 23 13 8 4

≥ 24 6 3 2

1) By a χ² independence test, with a 2% significance level, decide wether there’s a link or not between the age

and the level of attendance at the cinema.

Y X [15 ; 25[ [25 ; 50[ 50 and more total

obs th χ² obs th χ² obs th χ² obs th χ²

none 4 7.59 1.698 6 7.59 0.333 13 7.82 3.431 23 23 5.462

1 to 11 10 13.53 0.921 16 13.53 0.451 15 13.94 0.081 41 41 1.453

12 to 23 13 8.25 2.735 8 8.25 0.008 4 8.5 2.382 25 25 5.125

≥ 24 6 3.63 1.547 3 3.63 0.109 2 3.74 0.81 11 11 2.466

total

33 33 6.901 33 33 0.901 34 34 6.704 100 100 14.51

With 6 dof and α = 2%, the χ² table gives Chi²lim = 15.03.

Our Chi²calc (14.51) doesn’t exceed it. So, at a 2% significance level, we can’t reject the idea that age and

level of attendance at the cinema are independent.

2) Using your form table, discuss the level of confidence you can assign to the assertion : “they are

dependent”.

Our Chi²calc (14.51) is located between both Chi²lim of levels 2% and 5%. Thus, we can assume dependence

with more than 95% confidence, but with less than 98% confidence.

3) Identify the most important partial Khi-2s and give the meaning of these high values.

The biggest partial Chi² has been obtained with the “50 year old and more” whose attendance is zero: the

observed frequency (13) is much higher than the expected one (7.82).

The partial Chi² of the “50 year old and more” whose attendance is “between 12 and 23 times a year” is big

too: the observed frequency is much lower than the theoretical one

The partial Chi² of the “15 to 25 year old” whose attendance is “between 12 and 23 times a year” is big too:

the observed frequency is much higher than the theoretical one.



Let’s have a close look of a company’s turnover evolution through time.

2009 2010 2011 2012

tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4

(M€) 28 45 49 36 30 44 48 40 28 46 52 37 31 42 54 39

Though there are big seasonal variations, due to its particular activity, is it possible to find out a global

trend on several years?

Let’s decide to calculate and display the 5 by 5 moving means:

(do it as a group job: divide the set of calculations with your neighbors and share your results)

1-5 2-6 3-7 …

X 3 4 5 6 7 8 9 10 11 12 13 14

Y 37.6 40.8 41.4 39.6 38 41.2 42.8 40.6 38.8 41.6 43.2 40.6

calculations:

The values of X (on the graph) correspond to the way the trimesters can be numbered : 1st trim. 2009 →

x = 1 ; 2nd trim. 2009 → x = 2 ; and so on. The values of X in the above table can then be deduced : that's

the list of the integers from 3 to 14 : 1st value = mean of 1,2,3,4,5 = 3 ; 2nd value = mean of 2,3,4,5,6 = 4 ;

and so on until the 12th value, which is the mean of 12,13,14,15,16, that is 14.

The Y values, sorted in the above table, are the average turnovers of the company during each

considered group of five trimesters. e.g. : 1st value of Y = mean of 28,45,49,36,30 = 37.6 ; 2nd value of Y =

mean of 45,49,36,30,44 = 40.8 ; and so on.

× × × × × × × × × × × ×



Let’s take back one of the examples introduced page 3 (lessons doc): effect of the amount of fertilizer on the

harvested production.

fertilizer harvest

plot # X (kg.ha-1) Y (q.ha-1)

1 150 46

2 80 37

3 120 46

4 220 51

5 100 43

1) For each half-cloud, determine the mean points coordinates.

Half-clouds have to be defined: since there are 5 pairs of results, let’s choose a cut in 3 points on the left and 2

points on the right (the contrary would have been allowed too), separating them by the X values (always):

1st half-cloud: (80, 37), (100, 43), (120, 46); mean point: G1(100, 42)

2nd half-cloud: (150, 46), (220, 51); mean point: G2(185, 48.5)

2) Determine the expression of the Mayer’s line (G1G2).

slope: . .

.−= = ≈

−48 5 42 6 5

0 07647185 100 85

a

y = 0.07647 x + b can be written with the coordinates of G1 (for instance): 100 = 0.07647×42 + b,

which gives us b = 34.35.

Expression of the Mayer’s line: y = 0.07647 x + 34.35

3) On a graph, plot the initial table and draw this line.

Exercise 7.

Determine the expression of the Mayer’s line, taking back the case given in exercise 5.

The 16 values are parted in 8 for 2009 and 2010 besides 8 for 2011 and 2012.

....

...

1

1

1 2 84 5

8

28 45 4040

8

G

G

x

y

+ + += =

+ + += =

....

....

2

2

9 10 1612 5

8

28 46 3941 125

8

G

G

x

y

+ + += =

+ + += =

slope: .

.1 125

0 1406258

a = =

y = 0.140625 x + b can be written with the coordinates of G1 (for instance): 40 = 0.140625×4.5 + b,

which gives us b = 39.367.

Expression of the Mayer’s line: y = 0.140625 x + 39.367



Calculate or display on your calculator: the means and standard deviations; the covariance.

1) Taking the data of exercise 6 (fertilizer/harvest)

= 134x kg.ha-1 and .= 44 6y q.ha-1 ; ( ) .σ = 48 826X kg.ha-1

and ( ) .σ = 4 5869Y q.ha-1

(Stat mode).

( ), . .== − = − × =∑

1 30900134 44 6 203 6

5

n

i i

i

x y

Cov X Y x yn

2) Taking the data of exercise 4 (age/# of cinema shows) – choose 60 as average age for the class 50

and more; choose 36 as average number of shows for the class 24 and more.

.= 39 375x yo and .= 10 795y shows ; ( ) .σ = 16 422X years and ( ) .σ = 10 833Y shows (Stat mode).

( ), . . .== − = − × = −∑

1 3689039 375 10 795 56 15

100

n

i i

i

x y

Cov X Y x yn


Let’s consider the following time series: a company’s annual expenses in advertising.

X : year 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

Y : expense (k€) 41 60 55 66 87 61 90 95 82 120 125 118

The corresponding scatter plot is represented:

Determine the expression of the Y on X fitting line, following the least square method; then, draw it.

(D) : y = 7.0629x + 37.42


Exercise 10.

The following table indicates the sales price (€) of an equipment and the number of sold items, for 4 years.

year rank 1 2 3 4

sales price (€) X 300 210 270 375

# of sold items Y 198 240 222 160

1) Build the scatter plot with an orthogonal frame. The axes intersection must be the point (210, 160);

scales: 1 cm for €15 on the abscissas axis, 1 cm for 10 items on the ordinates axis.

2) Determine the coordinates of G, mean point of the cloud.

G(288.75 ; 205)

3) a. Determine the expression of the Y on X fitting line, following the least square method.

The coefficients will be expressed with 6 significant figures.

y = -0.498274 x + 348.876

b. Draw this regression line on the graph.

4) Which year saw the highest turnover? For which amount?

The turnover is X×Y. Its four values are: 59400, 59940, 50400 and 60000. The highest was in year # 4.

going further:

5) Now, we assume that, each year, the number of sold items y and the sales price x are related this way:

y = – 0.498 x + 349. We denote S(x) the turnover achieved by selling y items, €x each.

a. Express S(x) with respect to x.

S(X) = xy = -0.498 x² + 349 x

b. Find the variations of the function S defined in [210 ; 375].

S’(X) = -0.996 x + 349 > 0 iff x < 350.4. S is decreasing in [210 ; 350.4] and increasing in [350,4 ; 375].

c. Deduce the sales price we would have to set for a fifth year if we want a maximum turnover. How many

items will be sold (round to one unit)? For what turnover?

We have to set the sales price at €350,4. # of sold items: y = – 0.498×350.4 + 349 = 174.5.

Considering 174 items, x = €350.4/unit and turnover = €60969.6;

considering 175 items, x = €350.4/unit and turnover = €61320.


Exercise 11.

500 people, having passed their driving license exam, are sorted in the table below.

They are distributed with respect to the number X of times they took the exam before passin it and to the

number Y of hours of driving lessons before their first attempt.

Y

[0 ; 15[ [15 ; 25[ [25 ; 40[

X

1 23 92 80

2 77 84 33

3 42 35 13

4 12 6 3

1) Define a margin frequency. Then, give an example from the table.

A margin frequency is the total number of individuals associated to a value of one of the variables.

e.g.: 195 (margin frequency) people passed their exam following their first try (value: X = 1).

2) Describe, shortly, the way to enter the data set in your calculator.

We use to enter the frequencies in List3, so 12 values here; List1 and List2 will be used for entering the

corresponding X and Y values.

3) Calculate the covariance of the pair (X, Y) and give a concrete comment about this value.

( ), . . .= − × = −168151 874 19 375 2 679

500Cov X Y , non-positive. Globally, the more hours of driving lessons one

takes, the less attempts one needs to pass the exam.

4) Among those who took between 15 and 25 hours of driving lessons, what is the rate of those who passed

their exam on the third attempt? 35/217 = 16.13 %

5) Among those who passed their exam on the third attempt, what is the rate of those who took between 15

and 25 hours of driving lessons? 35/90 = 38.89 %

Exercise 12.

A sales agent wishes to analyse his (or her) activity and efficiency. On each appointment to a prospect have

been noted the length (X, in minutes) of the presentation of the product, and the sold quantity (Y). The twelve

values inside the table were filled with the number of appointments that correspond to each pair (X, Y).

Y

X

0 1 2 3

[0 ; 10[ 3 2 2 0

[10 ; 20[ 0 4 8 7

[20 ; 30[ 1 5 12 3

1) Give the meaning of the frequency "8" found inside the table.

During each of 8 appointments with prospects, the sales agent made a 10 to 20 min-long presentation and

then sold 2 units.

2) Calculate, manually, the average time spent per appointment.

Margin frequencies of the three values of X: 7, 19 and 21. The corresponding lengths are 5, 15 and 25 (in

minutes). Total number of appointments: 47.

The average time is then (5×7 + 19×15 + 21×25)/47 = 17.98 minutes per appointment (about 18 minutes).

3) Give the covariance of the pair (X, Y).

( ), . . .= − × =159517 9787 1 80851 1 421

12Cov X Y



Data about the fuel consumption of a motorcycle have been

collected. Consumption: Y, in L/100km, speed: X, in km/h) :

X 10 20 30 40 50 60 70 80 90

Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6

The scatter plot, on the right, clearly shows us that a linear

regression would be inappropriate to describe the evolution of the

consumption with respect to the speed. Thus, we will propose a

variable change.

1) Let’s define the variable T by: T = (X – 60)².

Complete the following table:

T 2500 1600 900 400 100 0 100 400 900

Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6

2) Perform a linear regression of Y on T.

Cov(T, Y) = 81280/9 – 766.66667×9.111111 = 2045.926 ; r = 2045.926/780.3133/2.62782 = 0.997759

r is very close to 1, a linear fitting is appropriate, between T and Y.

Least square regression line: y = 0.00336 t + 6.535

3) Thus, deduce the expression of the regression curve, for the initial scatter plot.

Regression curve of the pair (X, Y) : y = 0.00336 (x – 60)² + 6.535

Exercise 14. quadratic fitting

A company took note of its profits Y with respect to X, produced and sold quantity:

X (tons) 2 3 5 7 11

Y (k€) 38 55 72 69 24

T -16 -9 -1 -1 -25

1) Thanks to your calculator, give the linear correlation coefficient between X and Y. Comment.

Cov(X, Y) = 1348/5 – 5.6×51.6 = -19.36 ; r = -19.36/3,2/18.315 = -0.3303

This is far from -1, the linear correlation is very bad between X and Y.

2) Let’s settle the variable T = -(X - 6)².

a. Complete the table.

b. Calculate Cov(T, Y) and then the linear correlation coefficient between both variables.

Cov(T, Y) = -1844/5 - (-10.4)×51.6 = 167.84 ; r = 167.84/9.2/18.315 = 0.9961

c. Is a linear fitting of Y on T appropriate?

r is very close to 1, a linear fitting is appropriate, between T and Y.

d. Determine the expression of the Y on T fitting line, following the least square method.

y = 1.983 t + 72.22

e. Deduce an expression of the regression of Y on X.

y = -1.983(x - 6)² + 72.22


Exercise 15. quadratic fitting

A market study was conducted on a new type of product. The table below gives, for several proposed sales

price, the number of people willing to pay that price.

unit price (€) X 2 3 4 5 6 7

number of people Y 66 47 34 25 18 14

pu nb X² nbv

X Y T Y’ CA CA’

2 66 -36 62.97 132 125.9

3 47 -51 48.88 141 146.6

4 34 -64 36.66 136 146.7

5 25 -75 26.33 125 131.7

6 18 -84 17.88 108 107.3

7 14 -91 11.3 98 79.13

1) Calculate the covariance of the variables X and Y, then comment its sign.

( ), . .= − × = −7404 5 34 29 67

6Cov X Y , non-positive: Y values tend to improve as X decreases.

2) We set T = X(X - 20)

a. Calculate le the linear correlation coefficient between both variables T and Y.

( ) ( ), . .−= − − × =11610

66 8333 34 337 336

Cov T Y . .

.. .

= =×

337 330 992487

18 95096 17 93507r

b. Comment its value.

This coefficient (0.992487) is an excellent one.

c. Determine the expression of the Y on T fitting line, following the least square method.

y’ = 0.9393 t + 96.78

d. Deduce an expanded expression of the regression of Y with respect to X.

y’ = 0.9393 (x ² - 20x) + 96.78 = 0.9393 x² - 18.79 x + 96.78

3) Here we examine the expected turnover (unit selling price × number of sales), if the numbers of citations

obtained in the survey are considered to be the numbers of units sold.

a. Calculate the turnovers that can be extracted from the initial table.

See above: grey table (turnover = CA = XY)

b. Calculate, for the same values of X, the turnovers CA' tat can be got thanks to the formula obtained in

question 2)d.

See above: grey table (turnover = CA’ = XY’)

c. What unit selling price should we fix, so that the best turnover would be reached?

According to the model, it seems that CA’ would be maximum when X is between €3 and €4.

Le’s be a little more accurate:

X 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

CA' 146.6 147.4 148.1 148.5 148.8 148.8 148.7 148.4 148 147.4 146.6

We will recommend a selling price at about € 3.5 for an optimized turnover.


Exercise 16. inverse fitting

A perfumery, on analysing its turnover, connects the sales quantities (Y) to various perfume brands and

models prices (X). The results are gathered in the following table:

X, bottle’s price (€) 15 25 30 40 45 60 75 90

Y, # of sold bottles 202 117 107 82 78 60 55 48

Answer the questions beginning with "calculate" by using your calculator’s results.

calculator’s results:

1) a. Calculate the covariance of X and Y; comment its sign.

( ), . . .= − × = −2800047 5 93 625 947 2

8Cov X Y , non-positive. Y is globally a decreasing function of X.

b. Calculate the linear correlation coefficient of X and Y; comment its value.

..

. .

−= = −×

947 20 8357

24 109 46 843XY

r , not very close to 1. The linear correlation between X and Y is not

excellent (the point cloud may be noisy or following a curve).

2) In order to have a more precise idea of how X and Y are related, we set the variable change: = 850T

X

a. After having calculated the list of values of T, in a third list (calculator), justify that the linear correlation is

excellent between T and Y.

The values of T have been show above. The calculations, relatively to the pair (T, Y), lead to r = 0.9971,

very close to 1. Their linear correlation is excellent.

b. Give the expression of the Y on T regression line, according to the least square method.

y’ = 3.215 t + 15.62

c. What is the least square criterion?

The sum of the squared residues must be minimum (which makes the fitting line unique).

d. Deduce from question 2)b a modeled expression of Y with respect to X.

.′ = + = + = +850 273315 62

ay at b b

x x

e. According to this model, how many bottles whose cost is €150 would the perfumery expect to sell?

If x = 150, the estimate of y is: . .+ ≈ ≈273315 62 33 84 34

150: it can expect to sell 34 bottles.



Calculate the point estimates, in the given situations.

1) Taking back exercise 9, give an estimate of the expense in 2015.

y’ = 7.0629x + 37.42 ; x0 = 14 ; hence y’0 = k€ 136.3

2) Taking back exercise 6, give an estimate of the quantity of fertilizer that would offer a harvest of 60 q/ha.

y’ = 0.07647x + 34.35 ; y’0 = 60 q/ha ; hence x’0 = 335.4 kg/ha

3) Taking back exercise 13, give an estimate of the fuel consumption when the speed is 100 km/h.

y’ = 0.00336 (x – 60)² + 6.535 ; x0 = 100 ; hence y’0 = 11,91 L/100km


Let’s take back exercise 9. We want to estimate the expense, for the year 2015, by a 95% confidence interval.

1) a. Get the values of Y’, from the values of X and the expression of the fitting line;

b. Get the values of Z, by dividing Y by Y’;

c. Then, give the mean and standard deviation of Z.

. ; .σ= =1 000971 0 125286Z

z

2) Give the point estimate of the expense in 2015.

see exercise 17-1: y’0 = k€ 136.3

3) Give the coefficient u corresponding to the confidence level.

u = 1.96

4) Then, give the confidence interval.

[129.2(1.000971 – 1.96×0.125286) ; 129.2(1.000971 + 1.96 × 0.125286)]

= [97.6 ; 161]


With exercise 6, estimate the harvest by a 99% confidence interval, due to 300 kg/ha of fertilizer.

1) a. Get the values of Y’, from the values of X and the expression of the fitting line;

b. Get the values of Z, by dividing Y by Y’;

c. Then, give the mean and standard deviation of Z.

. ; .σ= =0 9991106 0 0472554Z

z

2) Give a point estimate of the harvest.

y’ = 0.07647x + 34.35 ; x0 = 300 kg/ha ; hence y’0 = 57.29 q/ha

3) Give the coefficient u corresponding to the confidence level.

u = 2.58

4) Then, give the confidence interval.

[57.29(0.9991 – 2.58×0.047255) ; 57.29(0.9991 + 2.58×0.047255)] = [50.25 ; 64.22]



On each person in a sample, a survey noted the age class (X) and the visual acuity (Y, 1/10 = 0.1):

X

[5 ; 35[ [35 ; 45[ [45 ; 55[ [55 ; 65[

Y

0.3 1 5 10 20

0.6 8 12 25 18

0.9 55 30 14 6

Estimate the visual acuity of a 80 year-old person, by a 99% confidence interval.

Variable Y, variable Z, results on Z:

. ; .σ= =0 999266 0 298378Z

z

Point estimate:

y’ = -0.008422x + 1.038 ; x0 = 80 ; hence y’0 = 0.3642

Coefficient u: u = 2.58

Confidence interval:

[0.3642(0.999266 – 2.58×0.298378) ;

0.3642(0.999266 + 2.58×0.298378)]

= [0.08358 ; 0.6444]

Exercise 21.

In a country, wo variables are compared: the consumer force index and the turnover of its car industry:

consumer force (index) X 3.26 3.85 3.44 3.08 3.6

car industry turnover (G€) Y 9.3 9.56 9.36 9.24 9.47

1) Give the expression of the Y on X Mayer’s line.

Two ways to cut this data set (3 points then 2, or 2 points then 3) as X increases.

case #1: G1(3.26 ; 9.3) and G2(3.725 ; 9.515) y = 0.4624 x + 7.793

case #2: G1(3.17 ; 9.27) and G2(3.63 ; 9.463) y = 0.4283 x + 7.912

2) By the mean of a point estimate, give a value of the consumer force that would correspond to a G€ 10

car industry turnover.

case #1: y = 10 iff x = 4.733

case #2: y = 10 iff x = 4.875

3) Is a strong correlation between two variables a sign of a cause and effect relationship between them?

Not necessarily. This numerical relationship may just be a coïncidence.

Exercise 22. least square + confidence interval

Monthly revenues of a commercial website are listed below, from January to December 2015:

in k€ : 3 5 4 8 10 9 13 12 17 18 18 21

1) In a few words, describe the least square method.

This method consists in finding out the line that minimizes the sum of the squared residues (rises between

the points and the line).

2) Thanks to the global trend of the evolution of the monthly revenue, give the 95% confidence interval of the

predictable revenue in December 2016. (number the months from 1 for January 2015)


month, X 1 2 3 4 5 6 7 8 9 10 11 12

revenue, Y 3 5 4 8 10 9 13 12 17 18 18 21

Y’ 2.5 4.136 5.573 7.409 9.045 10.68 12.32 13.95 15.59 17.23 18.86 20.5

Z 1.2 1.209 0.693 1.08 1.106 0.843 1.055 0.86 1.09 1.045 0.954 1.024

Expression of the Y on X regression line: y’ = 1.636 x + 0.8636

Point estimate of the revenue in December 2016 (x = 24): y’0 = k€ 40.14

Variable Z : z = 1.0132222 and Z

σ = 0.14538387

Coefficient u for a 95 % confidence level: u = 1.96

Confidence interval: [29.23 ; 52.10]

3) Give the probability that, in December 2016, the revenue would be less than k€ 29.23.

There are 95% chances that this revenue be inside this interval. Moreover, the concept of confidence

interval involves a symmetric probability distribution (the normal law); thus, there are 2.5% chances that

the revenue would be less than the values included in the interval, and 2.5% chances that it would be more

than them. Answer: 2.5%.

4) Buid the scatter plot (scale: 2 cm for one month), draw the regression line and finally represent the

confidence interval.

Y

revenue (k€)

X

month


Exercise 23. Mayer + confidence interval

city X Y The given table includes eight among major cities of a country. The variable X

gives, in thousands, the number of city residents; the variable Y gives, in

thousands, the number of students in this city.

1) Build the scatter plot from this data series. see below

2) Give the coordinates of the mean point of the cloud. G(439.1 ; 26)

3) a. Using Mayer’s method, determine manually the expression of the Y on X

regression line.

G1(273.3 ; 13.75) and G2(605 ; 38.25) slope: a = 0.07385

With G1: b = y – ax = -6.430 expression: y = 0.07385 x - 6.43

A 850 58

B 623 37

C 587 38

D 360 20

E 312 16

F 275 15

G 262 12

H 244 12

b. Draw this line. Does G belong to it? G always belongs to it

c. Give "Mayer’s principle". the sum of the residues must be zero

4) We will use here another fitting line, whose expression is: y' = 0.07x - 6.

a. With this line, give the 95% confidence interval of the predictable number of students in a town that has

two million inhabitants.

X 850 623 587 360 312 275 262 244

Y 58 37 58 20 16 15 12 12

Y’ 53.5 37.61 35.09 19.2 15.84 13.25 12.34 11.08

Z 1.084 0.984 1.083 1.042 1.01 1.132 0.972 1.083

Expression of the Y on X regression line: y’ = 0.07 x - 6

Point estimate of the number of students (x = 2000): y’0 = 134


σ = 0.052588

Coefficient u associated to a 95 % confidence level: u = 1.96

Confidence interval: [126.7 ; 154.3]

b. What can we say about the chances that the number of students would exceed 155,000 in such a town ?

There are a bit less than 2.5 % chances.

Y : # students

(thousands)

X : # residents

(thousands)


Exercise 24. logarithmic fitting + confidence interval

Service life of some identical office equipment has been studied. In the following table, ti represents the

duration of use - expressed in thousands of hours - and R(ti) the rate of equipment still in use at the time ti.

(e.g. : after 1,000 hours, ti = 1, there are still 90 % left of equipment in use, R(ti) = 0.90)..

ti 1 2 3 4 5 6 7 8 9

R(ti) 0,9 0,66 0,53 0,4 0,32 0,25 0,19 0,14 0,1

1) We set yi = ln[R(ti)] where ln is the natural logarithm. Fill the following table, then build the scatter plot,

using the points Mi (ti, yi), into an orthogonal frame.

ti 1 2 3 4 5 6 7 8 9

yi -0.105 -0.416 -0.635 -0.916 -1.139 -1.386 -1.661 -1.966 -2.303

2) May a linear fitting be relevant in the previous point?

Calculate the linear correlation coefficient between T and Y.

These points are almost collinear; a linear fitting appears to be relevant.

3) Using the least square method, determine an expression of the Y on T regression line.

Deduce from this expression that there are two positive real numbers k and λ such that: R(t) = k e- λt.

y’ = -0.26604 t + 0.1605 . y = ln R(t) implies R(t) = ey = e-0.26604 t + 0.1605 = e0.1605 × e-0.26604 t = 1.174 e-0.26604 t .

4) In this question, we'll take k = 1.174 and λ = 0.266.

a. Determine the predictable rate of equipment still in use after 10,000 hours.

After 10,000 hours, t = 10 ; hence R(t) = 1.174 e- 2.66 = 0.08184 = 8,2 % rounded.

b. After how long are there exactly 50 % of equipment still in use?

R(t) = 0.5 implies 1.174 e- 0.266 t = 0.5 iff e- 0.266 t = 0.5/1.174 iff -0.266 t = ln(0.5/1.174)

iff t = ln(0.5/1.174) / (-0.266) = 3.209. Answer: after 3,209 hours.

5) Give a 99% confidence interval of the rate of equipment still in use after 10,000 hours of service.

T 1 2 3 4 5 6 7 8 9

Y -0.105 -0.416 -0.635 -0.916 -1.139 -1.386 -1.661 -1.966 -2.303

Y’ -0.106 -0.372 -0.638 -0.904 -1.170 -1.436 -1.702 -1.968 -2.234

Z 0.998 1.118 0.996 1.014 0.974 0.966 0.976 0.999 1.031

Expression of the Y on T regression line: y’ = -0.26604 t + 0.1605

Point estimate of the rate (t = 10) : y’0 = -2.5


σ = 0.043476

Coefficient u associated to a 99 % confidence level: u = 2.58

Confidence interval on y : [-2.8003 ; -2.2395] and the the interval on R is: [0.0608 ; 0.1065].


Exercise 25.

100 children have been classified by age (X) and size (Y):

Y

[95 ; 105[ [105 ; 125[ [125 ; 135[

X

[3 ; 5[ 15 10 0

[5 ; 7[ 8 32 5

[7 ; 9[ 2 13 15

1) Enter this table in your calculator.

2) Give the means and standard deviations of X and Y, calculate their covariance.

( ) ( ). , . . , .239406 1 year 6 1 2 19 1 480 year

100x s V X Xσ = = − = =

;

( ) ( ). , . . , .21315375114 25 cm 114 25 100 6875 10 03 cm

100y V Y Yσ = = − = =

.

( ), . . .70540

6 1 114 25 8 475100

Cov X Y = − × = .

3) Calculate their linear correlation coefficient. Comment this value.

..

. .

8 4750 5709

1 480 10 03r = =

×, a very weak linear correlation (the cloud may be noisy and curved).

4) Nevertheless, does the table allow us to see some trend?

We see that from one age to another, the sizes corresponding to the greatest number of individuals are not

the same. But these largest frequencies do not represent, in their column, an overwhelming majority,

which reflects a high variability of sizes for children of the same age. Modeling the growth of a child by a

straight line is therefore difficult, or even by a well-defined curve.

5) Assuming that the relationship between age and size is linear until the age of 12, give the 95% confidence

interval of the size of a 12 year-old child.

X 4 6 8 4 6 8 4 6 8

Y 100 100 100 115 115 115 130 130 130

n 15 8 2 10 32 13 0 5 15

Y’ 106.12 113.86 121.6 106.12 113.86 121.6 106.12 113.86 121.6

Z 0.94233 0.87827 0.82237 1.08368 1.01001 0.94572 1.22503 1.14175 1.06908

Expression of the Y on X regression line: y’ = 3.87 x + 90.64

Point estimate of the size of a 12 yo child (x = 12): y’0 = 137.08 cm


σ = 0.121881

Coefficient u corresponding to a 99 % confidence level: u = 1.96

Confidence interval on y : [106.1 ; 171.6].

Documents

eS2 - Stat2Var - TEX CORR - Rev 2016jff-dut-tc.weebly.com/uploads/1/4/7/9/...rev_2016.pdf · Two candidates compete for a presidential election: NS and FH. In a little town, there