Statistical Methods for Engineering

SCHOOL OF ENGINEERING

DIPLOMA IN INDUSTRIAL & OPERATIONS MANAGEMENT

DIPLOMA IN SUPPLY CHAIN MANAGEMENT

DIPLOMA IN CIVIL AVIATION

P01 – SUMMARIZE AND PRESENT

E214 : STATISTICAL METHODS FOR ENGINEERING

Copyright © 2009 School of Engineering, Republic Polytechnic, Singapore

All rights reserved. No part of this document may be reproduced, stored in a retrieval

system, or transmitted, in any form or by any means, electronic, mechanical,

photocopying, recording or otherwise, without the prior permission of the School of

Engineering, Republic Polytechnic, Singapore.

SCHOOL OF

ENGINEERING

Page 2 of 3

Summarize and Present It is reported that male polytechnic students have a higher failure rate in the National Physical Fitness Award (Napfa) test compared with JC students. This is despite the fact that many polytechnic students play sports or exercise at least once a week. To help male students pass the Napfa test, your school has introduced an exercise program to help improve their fitness level. Attached are some data from the exercise program:

Data.xls

You are asked to summarize the data and present them in a form that is useful and easy to understand at a glance. How would you describe and measure the different sets of data? What display graphs or tools can you use?

SCHOOL OF

ENGINEERING

Page 3 of 3

1School of Engineering

E214 – Statistical Methods for

Engineering

P01 – Summarize and Present


What is Statistics?

• Statistics provides a basis for assessing and drawing a conclusion.

• Statistics plays a critical role in the improvement of the quality of any product or service. It enables engineers to understand phenomena subject to variation and to effectively predict or control them.

• Basic idea behind all statistical methods of data analysis is to make inferences about a population by studying a relatively small sample chosen from it.

• Everything dealing with the collection, processing, analysis, and interpretation of numerical data belongs to the domain of statistics.


Descriptive Vs Inferential Statistics

Descriptive

• Enable understanding of important features or provide insight of data through the use of values and graphical presentations

• Purpose is to organize and summarize the data collected in some meaningful forms or measures that are easily understood

• Examples:

Charts, graphs, plots, measures of mean, median, frequency, standard deviation.

Inferential

Consists of:

• Making claims about population from data collected in sample

• Performing estimations about population characteristics and making predictions

• Determining relationships among variables

• Examples:

Hypothesis Testing, ANOVA, correlation analysis


Stem-and-Leaf Plot• Simple way to summarize a data set

• Compact way to represent data, and provides some indication of its

shape

• Stem-and-leaf plot displays all the sample values but the order the items

were sampled cannot be determined

• Example of a Stem-and-Leaf Plot

24 24 26 26 26 27 27 27 27 28 29 30 30

31 33 35 36 36 37 37 43 45 45 46 48 49

50 50 51 53 53 55 56 57 58 59 59 60 60

Stem Leaf 2 44666777789 3 001356677 4 355689 5 00133567899 6 00


Box Plot

• Box Plot presents the median, first and third quartiles, and outliers. It is used to compare samples.

• The box plot has two whiskers and two parts of the box, each representing one quarter of the data.

• ‘Whiskers’ extend from the top and bottom of the box and end at the most extreme data point that is not an outlier

• Interquartile range (IQR) is the difference between the third quartile and first quartile

• Outliers lie more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile


Anatomy of a Box Plot

X

XX

Outliers

Third Quartile

First Quartile

Median

Largest data point

within 1.5 IQR of

the third quartile

Smallest data point

within 1.5 IQR of the

first quartile

(Taken from Navidi W., Statistics for Engineers and Scientists)


Distribution Shape and

Box Plot

Right-SkewedLeft-Skewed Symmetric

Q1 Q1 Q1Q2 Q2 Q2

Q2

Q3 Q3 Q3

Q1 Q3 Q1 Q2 Q3 Q1 Q2 Q3


Histogram• Most common form of graphical representation of frequency

distribution

• Useful in displaying shape, location and variability of the data

• Emphasizes irregularities and unusual features

• Sometimes it can be enough to draw a histogram in order to solve an

engineering problem

0

5

10

15

20

25

30

35

40

45

11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90

Marks

Fre

qu

ency

of

Pu

pil

s


Cumulative Frequency

A point on the horizontal axis of the cumulative frequency graph represents a

possible data value.

Its corresponding vertical plot gives the number of the data whose value are

less than or equal to it.

A cumulative frequency plot is called an Ogive.

0

20

40

60

80

100

120

140

160

180

10 20 30 40 50 60 70 80

Marks

Cu

mu

lati

ve F

req

uen

cy


Pareto Diagram

• Orders each type of failure or defect according to its frequency

• Very useful in the analysis of defect data in manufacturing systems. Helps engineers identify important defects and their causes

• When a process is identified as a candidate for improvement, the first step is to collect data on the frequency of each type of failure and then present the data on a Pareto Diagram

• Always arrange categories in descending order of frequency of occurrence, that is, the most frequently occurring is on the left, followed by the next most frequently occurring

• The horizontal scale of a Pareto Diagram is usually categorical classifications


Pareto Diagram

The Pareto Diagram highlights the relatively few types of defects that areresponsible for most of the observed defects.

Pareto diagram is an important part of a quality improvement program as it forces attention to the most critical defects.

Pareto diagram graphically depicts Pareto’s empirical law that any assortment of events consists of a few major and many minor elements. Typically, two or three elements will account for more than half the total frequency.

It is much easier to reduce or eliminate frequently occurring defects than rare ones.


Mean

• Defined as the sum of the observations divided by

sample size

• To emphasize that it is based on a set of observations, it

is often referred to as the sample mean

• It indicates the center of the data

• Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6


Median• The median of a sample is the middlemost value after the data

is arranged from smallest to largest

• It is not affected by extreme values (outliers)

• Eliminates the effect of extreme (very large or very small) values

• If n numbers are ordered from smallest to largest:– If n is odd, the median is the number in position (n+1)/2

– If n is even, the median is the average of the numbers in positions (n/2) and (n/2 + 1)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5


Mode

• Most frequently occurring value in a

sample

• There may be no mode, there may be

several modes

• It is not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

0 1 2 3 4 5 6

No Mode


Time Series Plot• Also known as Line Graph or Run Chart

• Displays data in a time sequence for a given period of

time

• Used to monitor whether there is a systematic change of

the data over time (trend)


Scatter Diagram• Different sets of data are plotted on different axes

• Show whether a relationship exists between 2 sets of

data, i.e. how much one is affected by the other


Shape of a Distribution

• Describes how data is distributed

• Measures of shape

– Symmetric or Skewed

Mean = Median =ModeMean < Median < Mode Mode < Median < Mean

Right-SkewedLeft-Skewed Symmetric


Quartiles and Percentiles

• The quartiles are the 25th, 50th and 75th percentiles– First quartile Q1 = 25th percentile

– Second quartile Q2 = 50th percentile

– Third quartile Q3 = 75th percentile

• Second quartile is equal to the median

• Interquartile range = third quartile - first quartile

• Example:Percentiles are often used to interpret scores on standardized tests.

If a student is informed that her score is on the 70th percentile, this means that 70 percent of students who took the test received lower scores.


Determining Quartiles and Percentiles

To calculate the sample 100 p-th Percentile:

1. Order the n observations from smallest to largest

2. Determine the product np

If np is not an integer, round it up to the next integer and find the corresponding ordered value.

If np is an integer, say k, calculate the mean of the k-th and the (k+1)-st ordered observations.

Example:

If n is 80, in order to find Q1, first obtain np

np = (80)(0.25)=20

Since np is an integer, Q1 is obtained by taking the

average of the 20th and 21st ordered observations.


Position of Quartiles and Percentiles

75th Percentile

50th Percentile

25th Percentile

Q1 Q2 Q3

Cumulative Frequency Graph


Dispersion

• Measures the spread of the values around the central tendency

• 2 common measures: range and standard deviation

• Standard deviation is an important measure of the variation in the data. You will learn more about it!


Today’s Problem


Conclusion

• Different graphical representations have different advantages

• Stem & Leaf Plot is a compact way to represent data, and provides some indication of its shape

• Box Plot presents the median, first and third quartiles, and outliers. It is used to compare samples

• Histogram is a common form of graphical representation of frequency distribution used for displaying shape, location and variability of the data

• A suitable graphical representation should be chosen depending on what you are interested to display


Learning Outcomes

• Differentiate between Descriptive and Inferential Statistics

• Select the Appropriate Data Display Tools– Frequency of Occurrence (Pie Chart, Pareto Chart)

– Distribution of Data (Stem and Leaf Plot, Box Plot, Histogram Plot)

– Trends over Time (Time Series Plot)

– Association (Scatter Diagram)

• Summary Measurements– Distribution

– Measure of Central Tendency (Mean, Median, Mode)

– Dispersion (Range, Standard Deviation)

– Quartiles and Percentiles





P02 – Describe it with Venn







SCHOOL OF

ENGINEERING

Page 2 of 2

Describe it with Venn The Land Transport Authority of Singapore (LTA) is interested to find out about whether Republic Polytechnic students have been adequately served by the public transport system. They are planning to conduct a survey to find out the proportion of students who take public transport to school, as well as the number who get to school on time using public transport. The mode of transport, whether it is by means of bus, train or both, should be indicated in the survey. How would you advise LTA to conduct this survey? When the survey is completed, how can the response be analyzed using a Venn diagram to determine if the public transport system serving Republic Polytechnic is satisfactory?


E214 – Statistical Methods for

Engineering

P02 – Describe it with Venn


Sample Space and Events

• In statistics, a set of all possible outcomes of an

experiment is called a sample space.

• Sample spaces are usually denoted by the

letter S.

• In statistics, any subset of a sample space is

called an event.

• A subset is any part of a set, including the

whole set, and a set called the empty set (denoted by Ø).

• The empty set has no elements at all.


Sample Space and Events

An Example:

• Roll a die and observe the number obtained.

• In this example, rolling the die is the experiment.

• The only possible outcomes are 1, 2, 3, 4, 5 or 6.

• The event that a die comes up an even number

is:

– The sample space for the experiment is S =

{1,2,3,4,5,6}

– Coming up an even number corresponds to Even =

{2,4,6}


Mutually Exclusive Events

• Mutually Exclusive events have no elements in common.

• For example, it is impossible that a coin can come up both

heads and tails.

• Such an event is said to be mutually exclusive.

• The events A and B are said to be mutually exclusive if

they have no outcomes in common.

• Eg. Rain („A‟) or no rain („B‟) at 12pm are mutually

exclusive events

A B

S


Union

• If A and B are any two sets in a sample space S, their

union, denoted by A U B, is the subset of S that contains

all elements that are either in A, in B, or in both.

• In words, A U B, means “A and/or B”.

• Eg. Number of students clearing either PP („A‟) or CE („B‟)

or both.

S

A B


Intersection• If A and B are any two sets in a sample space S, their

intersection, denoted by A ∩ B, is the subset of S that

contains all elements that are in both A and B.

• In words, A ∩ B means “A and B”.

• In the previous example, A ∩ B indicates the number of

students clearing both CE and PP.

A B

S


Complement• The complement of an event A, denoted by Ac, is the

subset of S that contains all the elements of S that are not

in A.

• In words, Ac means “not A”.

• Eg. Ac for Event A which is taking bus means all

responses other than „Bus‟, i.e. „Train‟ or „Both‟ or „Others‟

Ā A

S


Probability TheoremsGiven a finite sample space S and an event A in S, we define P(A), the probability of A, to be a value of an additive set function that satisfies the following three conditions:

• Axiom 1 0 ≤ P(A) ≤ 1 for each event A in S.

• Axiom 2 P(S) = 1.

• Axiom 3: If A and B are mutually exclusive events in S, then

P (A U B) = P(A) + P(B).

(An Axiom is any starting assumption from which other statements are logically derived. It requires no proof.)

• Probability functions must be additive.

– If A1, A2, A3, ... are mutually exclusive events in a sample space S, then

P (A1 U A2 U A3 U … U An ) = P(A1) + P(A2) + P(A3) + …+ P(An)

• For any event A, P(Ac) = 1 – P(A)


A ∩

B

Addition Rule

When A and B are non-mutually exclusive events in S,

P(A U B) = P(A) + P(B) – P(A ∩ B)

When A and B are mutually exclusive events in S,

P(A U B) = P(A) + P(B), since P(A ∩ B) = 0

A B

S

A U B


Mutually Exclusive vs Independent

A B

If A and B are mutually exclusive events

Then, A ∩ B = ø so that probability of A occurring given that B has

occurred is P(AIB)=0 and also P(BIA)=0

Two events are considered be independent if the occurrence of one is not

affected by the occurrence or nonoccurrence of the other.

The below Multiplication Rule applies if and only if A and B are independent:

P(A ∩ B) = P(A).P(B)

Hence, P(AIB) = P(A) and P(BIA) = P(B)


Independent Events

• If A and B are independent, then the following pairs are also independent:

– A and Bc , Ac and B, and Ac and Bc

• An ExampleA die is thrown twice. Find the probability of obtaining a 4 on the first throw and an odd number on the second throw.

Let A be the event „a 4 is obtained on the 1st throw‟ => P(A) = 1/6

Let B be the event „an odd number is obtained on the 2nd throw‟ => P(B) = 3/6 = ½, since B = {1, 3, 5}

Since the result of the 2nd throw is clearly not affected by the result of the 1st throw, A and B are independent events.

Hence, P(A ∩ B) = P(A). P(B) = 1/6 . 1/2 = 1/12


Today‟s Problem

Survey Questions:

Question 1 What is your primary mode of transport to RP?Response Bus, Train, Both, Others

Question 2 Under normal circumstances, do you usually arrive in RP on time?

Response Yes, No


Proposed Solution

S

A

C

B

9

2

1

2

0

D

2

5

4

Events

A : By Bus

B : By Train

C : By Bus and Train

Ac ∩ Bc ∩ Cc : By Others

D : On Time

Dc : Late

A ∩ Dc : Late by Bus

B ∩ D : Punctual by Train

(A ∩ D) U (B ∩ D) U (C ∩ D):

Punctual by Public Transport


Analysis

• 16/25 or 64% of respondents take public transport to RP and

arrive on time.

• It may be more meaningful to estimate the probability of students

being on time given that they take public transport (16/23 = 70%).

This is known as conditional probability.

• A follow-up survey can be conducted to determine the reasons of

arriving late by bus

• A bigger sample size is needed to better represent the population

that takes bus to RP and target audience should be selected

randomly to avoid biasness e.g. respondents should have

addresses in many parts of the island


Learning Outcomes

• Concept of:– Sample Space

– Events

– Mutually Exclusive Events

– Independent Events

– Unions, Intersections, and Complements

• Venn diagrams

• Additive Rules

• Multiplicative Rules





P03 – Dependent or Independent







SCHOOL OF

ENGINEERING

Page 2 of 2

Dependent or Independent Having developed a microcontroller based actuator that automatically lowers window shades in strong sunlight, Nathan knows he has to put the system through the rigors of testing before he can unveil it. His ego was bruised when home tests conducted by his school mates revealed that the shades did not lower with every incidence of strong sunlight. Knowing that a system running with a reliability of 95 percent is the industry tolerance for microcontroller based actuators, Nathan is determined to ascertain the reliability of his invention. He commissioned a laboratory to test the reliability of the system over the course of 100 incidences of strong sunlight. Attached are the schematic of the system and the results of the tests conducted by the laboratory:

Schematic of the System and Test Data

Based on the data collected, help Nathan analyse the reliability of the system.


P03:

Dependent or Independent

E214 – STATISTICAL METHODS IN ENGINEERING


Dependent Events

• Two events are said to be dependent if the occurrence

or outcome of the first event affects the probability of

occurrence of the second event.

Probability of both events occurring, P(A ∩ B)

= P(A).P(BIA) = P(B).P(AIB)

• ExampleThere are 2 red balls and 3 blue balls in a bag. If two balls are drawn at random without replacing the balls, find the probability that both balls are red.

P(both red balls) = P(1st ball is red).P(2nd ball is red after first draw is red) = 2/5 x 1/4 = 1/10

The probability that the 2nd ball is red is clearly dependent on the result of the 1st draw.


Conditional Probability• Probability of event A occurring given that event B has already occurred

is written as P(A|B)

S

BA

P(A|B) = P(A∩B)/P(B) …….. (1)

P(BIA) = P(B∩A)/P(A) …….. (2)

Since P(A∩B) = P(B∩A), equating (1) and (2),

P(AIB).P(B) = P(BIA).P(A)

P(AIB) = [P(BIA).P(A)]/P(B) ------- BAYES RULE

P(AIB) is the probability of event A occurring, given that event B has already occurred.

Note that P(A|B) ≠ P(B|A).


Conditional Probability

Example

• Given that a heart is picked at random from a pack of 52 playing cards, find the probability that it is a picture heart card

• P (picture card | heart card)

= P (picture card ∩ heart card) / P (heart card)

=

= 3/13

52/13

52/3


Bayes’ Theorem: An ExampleAndy, Ben and Carrie pack biscuits in a factory. From the batch allotted to

them, Andy packs 55%, Ben 30% and Carrie 15%. The probability that Andy

breaks some biscuits in a packet is 0.7, for Ben it is 0.2, and, for Carrie the

probability is 0.1. What is the probability that a randomly selected packet

with broken biscuits is packed by Andy?

)(

)().|(

DP

APADP

Solution:

Let A be the event ‘the packet was packed by Andy’, B be the event ‘the packet

was packed by Ben’, C be the event ‘the packet was packed by Carrie’, and, D

be the event ‘the packet contains broken biscuits’.

Given P(A) = 0.55, P(B) = 0.3, P(C) = 0.15, P(D|A) = 0.7, P(D|B) = 0.2,

P(D|C) =0.1

We require P(A|D). Using Bayes’ Rule,

P(A|D) =

P(D) =P(D|A).P(A) + P(D|B).P(B) + P(D|C).P(C)

=(0.7)(0.55) + (0.2)(0.3) + (0.1)(0.15) = 0.46

=46.0

)55.0)(7.0(= 0.837


Probability Tree• The probability of the final outcome is given by the sum of the products of the probabilities

corresponding to each branch of the tree.

• Probability Tree can be used for both dependent and independent events

• Using the probability tree to solve the preceding example:

AB

C

0.55

0.3

0.15

0.7

0.3

0.2

0.8

0.1

0.9

Broken -

(0.55)(0.7)

Broken - (0.30)(0.2)

Broken - (0.15)(0.1)

P(AID) = )1.0)(15.0()2.0)(3.0()7.0)(55.0(

)7.0)(55.0(

= 0.837

Not Broken

Not Broken

Not Broken


Sequence of Dependent Events

Example

A bag contains eight green counters and three black counters. Two counters

are drawn, one after the other without replacement. Find the probability of

drawing one green and one black counter, in any order.

B

G

B

G

B

G

P(G1) = 8/11

P(G2|G1) = 7/10

P(B2|G1) = 3/10

P(B1) = 3/11

P(G2|B1) = 8/10

P(B2|B1) = 2/10

1st Draw

2nd Draw

P(Drawing 1G & 1B)

= (8/11)(3/10)

= 24/110 + 24/110

= 24/55

+ (3/11)(8/10)


Proposed Probability Tree Solution

Microcontroller

works

No: 0.06

Yes: 0.94

Yes: 0.9726

Actuator

works

Yes: 0.88

System

state

No: 0.12

Strong Sunlight,

and at least one

sensor works

Yes: 0.88

No: 0.12

No: 0.06

Yes: 0.94

Yes: 0.88

No: 0.12

Yes: 0.88

No: 0.12 0.000197

0.001447

0.003091

0.022665

0.007003

0.051353

0.804535

(Prob. of shade

working)

No: 0.30(0.24)(0.38)

= 0.0274

0.109709


Solution Analysis

Assumptions

1. Non-mutually exclusive events Failure of one

component does not preclude the failure of another.

Example: Any of the sensor and the actuator can fail

together.

2. Independent Failures The likelihood of a component

failing is not affected by the occurrence of other failures

Example: Probability of microcontroller working is the

same regardless of whether the sensor is functioning


Solution Analysis

From the Probability Tree:

• Probability of system working successfully on a sunny

day = 0.8045

• Probability of shade not lowering on a sunny day = 1 –

0.8045 = 0.1955

• P (system fails and sensor is faulty) = 0.0274

• P(system fails and only microcontroller is faulty) = 0.0513

• P(system fails and only actuator is faulty) = 0.1097


Solution Analysis

Conditional Probability:

• P(system fails | microcontroller is faulty) = 1

• P(system fails | sensor is working) = P(system fails and

sensor works)/ P(sensor works)

= 0.1681 / 0.9726 = 0.1728


Solution Analysis

Conditional Probability:

• P(only actuator is faulty | system fails) = P(only

actuator fails and system fails)/ P(system fails)

= 0.1097/(1-0.8045) = 0.5613


Learning Outcomes

• Conditional Probability

• Bayes’ Theorem

• Probability Tree





P04 – MANY COMBINATIONS







SCHOOL OF

ENGINEERING

Page 2 of 2

Many Combinations

You are a perfume connoisseur who is in charge of creating a new family of perfumes. The scent of a perfume is formed by different components known as notes. There are 3 types of notes, namely top note, middle note and base note. The different notes unfold over time when a perfume is applied and combine to describe the scent of the perfume. You have shortlisted a list of 16 aromatic compounds of which 3 will be used for top notes, 8 for middle notes and 5 for base notes. The new perfume is designed to have 6 different aromatic compounds and must contain at least one of each type of notes. If it takes the company 3 days to produce and test 100 different perfumes, how much time is required for testing all possible combinations of aromas? You may assume that each different perfume would contain a unique combination of compounds.

Illustrative Figures on Perfume Notes

http://donny.blogdetik.com/files/2008/09/olfactive_pyramid.jpg

School of Engineering

E214 Statistical Methods for Engineering

P04

Many Combinations



(2)

Permutations

• A permutation is an ordered arrangement of distinct objects.

• One permutation differs from another if the order of arrangement differs or if the content differs.

• How many ways are there to arrange three boys –A, B, and C?

• The possible arrangements are ABC, ACB, BAC, BCA, CAB, CBA. There are six ways.

• Each arrangement is called a permutation.



(3)

Permutation of n different objects

• For the first boy, we can choose from A, B or C (3 ways).

• Once the first boy is chosen, the second boy can be chosen from the 2 remaining boys (2 ways).

• The third boy has to be the remaining boy (1 way).

• Number of ways = 3 x 2 x 1 = 3! = 6

• Number of ways of arranging n different objects is n!

• n! = n(n-1)(n-2)….(3)(2)(1)



(4)

Permutation of n distinct objects

Example:

• How many ways can the letters A, B, C and D be arranged?

Approach:

• The 1st letter can be chosen in 4 ways (either A or B or C or D)

• The 2nd letter can be chosen in 3 ways.

• The 3rd letter can be chosen in 2 ways.

• The 4th letter can be chosen in only 1 way.

• Therefore, number of ways of arranging 4 letters

= 4! = 24



(5)

Permutation of n objects (not all distinct)

Example:

• If instead of the letters A, B, C, D, we have the letters A, A, A, D

Approach:• The 24 arrangements reduce to: AAAD, AADA, ADAA, DAAA

• The number of ways of arranging 4 objects, of which 3 are alike = 4! / 3! = 4

• The number of ways of arranging n objects of which p are of one type, q of another type, r of a third type and so on is

!...!!

!

rqp

n



(6)

Permutation of r objects from n objects• Consider the number of ways of placing 3 of the letters A, B, C,

D, E, F, G in 3 empty spaces.

• The 1st space can be filled in 7 ways. The 2nd space can be filled in 6 ways. The 3rd space can be filled in 5 ways.

• Therefore, there are (7)(6)(5) = 210 ways of arranging 3 letters taken from 7 letters.

• Number of permutations of 3 objects taken from 7

= 7P3=(7)(6)(5) =

• Number of permutations of r objects taken from n different objects is nPr =

)!37(

!7

!4

!7

)1)(2)(3)(4(

)1)(2)(3)(4)(5)(6)(7(

)!(

!

rn

n



(7)

Combinations• A combination is an arrangement of distinct objects where

one combination differs from another only if the content of the arrangement differs. Order does not matter.

• The number of combinations of n different objects taken r at a time, denoted by nCr is

Proof:

We are interested in determining the number of combinations when there are n distinct objects to be selected r at a time. Since the number of permutations was the number of ways to select r objects from the n and then permute the r objects, we note that nPr = r! nCr

Hence nCr = nPr / r! =

)!(!

!

rnr

n

)!(!

!

rnr

n



(8)

Combinations: Example 1

• Four letters are chosen at random from the word RANDOMLY. Find the probability that all four letters chosen are consonants.

• Let S be the possibility space, then n(S) = 8C4 = 70

• Let E be the event ‘4 consonants are chosen’. As there are 6 consonants, n(E) = 6C4 = 15

• P(E) = n(E)/n(S) = 15/70 = 3/14



(9)


• Suppose a box contains 8 chip processors, 3 of which are

defective. If 3 are sold at random, find the probabilities that:

– Exactly 2 are defective

– All 3 are defective

– At least 1 is defective

• Taking each chip as individual entities, we need to determine

all combinations when 3 chips are sold 8C3

Exactly 2 are defective

– Combination of 2 defective chips and 1 good chip 3C2 x 5C1

Hence, probability is 3C2 x 5C1/ 8C3 = (3x5) / 56 = 0.27



(10)


All 3 are defective

– Combination of 3 defective chips 3C3

Hence, probability is 3C3/ 8C3= 1 / 56 = 0.018

At least 1 is defective

– This is the complement of no defective chips 1 – P(no defective chip)

– Combination of 3 good chips 5C3

Hence, probability is 1- 5C3/ 8C3 = 1- 10/56 = 0.82



(11)

Proposed Solution

• The new perfume should have 6 different aromas

with at least 1 aroma in each note (top, middle and

base)

• Possible combinations are:

– 3 top notes + 2 middle + 1 base





…and so on. Total there are 9 different combinations of

top, middle and base notes.



(12)

Proposed SolutionTotal number of ways to create the perfume:

• 3 top notes + 1 middle + 2 base = 3C3 x 8C1 x 5C2 = 140





• 1 top note + 4 middle + 1 base = 3C1 x 8C4 x 5C1 = 1050




• Total = 5830

• Time required = 5830/100*3 = 175 days



(13)

Learning Outcomes

• Permutations

• Combinations





P05 – Chance Winnings







SCHOOL OF

ENGINEERING

Page 2 of 2

Chance Winnings

Entrusted with raising funds, and inspired by his recent holiday to Las Vegas, James determines that the fruit machine is a sure bet. It has after all earned the informal name of the one-armed bandit owing to its appearance and its ability to leave the gamer penniless. Wanting to raise as much funds as possible for the charity, James knows he must take into careful consideration the payout for each winning combination. If he pays out too much, he may end up making a loss over the three-day fund raiser. If the payout is too little, it might not generate any interest in playing the fruit machine. Having four windows, each showing at any one time either a lemon, an orange, an apple or cherries, the fruit machine has been configured to pay out when at least three windows show the same fruit. Knowing the probability of a window showing a particular fruit is 0.4 for lemons, 0.3 for oranges 0.2 for apples, 0.1 for cherries, how should James set the cost of each play and pay out to support the fund-raising?



P05 – Chance Winnings



(2)

Discrete Random Variable

• A random variable (r.v.) is a variable whose values are determined by chance.

• Random variables are denoted by capital letters (X, Y, etc.) to distinguish them from their possible values given in lower case x, y.

• Discrete random variables can take on only a finite number of values or an infinite number of values that can be counted.

• Example: A die is thrown 6 times. Let X = number of 5’s obtained.

– Then X is a discrete r.v. and x = 0, 1, 2, 3, 4, 5, 6



(3)

Basic Properties of a pdf

• The probability distribution of a discrete r.v. X is a list of the possible values of X together with their probabilities

f(x) = P[X=x]

• The probability of each event in the sample space must be between or equal to 0 or 1:

0 ≤ P[X=x] ≤ 1 for all x

• The sum of the probabilities of all events in the sample space must equal 1:

xall

xXP_

1][



(4)

Probability Density Function (pdf)

• A discrete PDF consists of all possible values that a discrete r.v. can take on, together with the associated probabilities.

• Example: Let X represent the outcomes when a fair die is tossed once. The pdf of X is:

x 1 2 3 4 5 6

P[X=x] 1/6 1/6 1/6 1/6 1/6 1/6

where P[X=x] means probability that the r.v. X takes

the value x.

Formula form: P[X=x] =1/6, for x=1,2,3,4,5,6



(5)

Expected Value E(X)

Let X be a discrete random variable. Then the

expected value of X, also know as the mean of

the r.v., is denoted by E(X):

E(X) = xall

xXxP_

)(



(6)

Example

• Find E(X) for the pdf of a single throw of a fair die.

Solution:

• E(X) =

= (1/6)(1) + (1/6)(2) + (1/6)(3) + (1/6)(4) + (1/6)(5)

+ (1/6)(6)

= 21/6 = 3.5

xall

xXxP

_

)(

x 1 2 3 4 5 6

P[X=x] 1/6 1/6 1/6 1/6 1/6 1/6



(7)

Properties of E(X)

• E(a) = a

• E(aX) = aE(X)

• E(aX + b) = aE(X) + b

• E[f(X) ± g(X)] = E[f(X)] ± E[g(X)]

where a and b are constant values



(8)

Example

The r.v. X has pdf P(X=x) for x = 1, 2, 3.

Calculate E(3), E(X), E(5X), E(5X + 3), E(X2), E(4X2 – 3)

E(3) = ∑all x 3P(X=x) = 3(0.1) + 3(0.6) + 3(0.3) = 3

E(X) = ∑all x xP(X=x) = 1(0.1) + 2(0.6) + 3(0.3) = 2.2

x 1 2 3

P(X = x) 0.1 0.6 0.3



(9)

Example

E(5X) = ∑all x 5xP(X=x) = 5(0.1) + 10(0.6) + 15(0.3)

= 11 = 5E(X)

E(5X + 3) = ∑all x (5x + 3)P(X=x)

= 8(0.1) + 13(0.6) + 18(0.3) = 14 = 5E(X) + 3

E(X2) = ∑all x x2P(X=x) = 1(0.1) + 4(0.6) + 9(0.3) = 5.2

E(4X2-3) = ∑all x (4x2-3)P(X=x) = 1(0.1) + 13(0.6) + 33(0.3) = 17.8 = 4E(X2) - 3



(10)

Variance, Var(X)

The variance of a discrete r.v. X measures the spread

or deviation of the r.v. about its mean value. It is

denoted by Var(X) or σ2:

Var(X) = E(X - µ)2

= E(X2 - 2µX + µ2) = E(X2)- 2µE(X) + E(µ2)

= E(X2)- 2µ2 + µ2

= E(X2) - µ2

= E(X2)- [E(X)]2



(11)

Example

The r.v. X has pdf shown below:

Find Var(X).

E(X) = 1(0.1) + 2(0.3) + 3(0.2) + 4(0.3) + 5(0.1) = 3

E(X2) = 1(0.1) + 4(0.3) + 9(0.2) + 16(0.3) + 25(0.1)

= 10.4

Var(X) = E(X2) – [E(X)]2 = 10.4 – 32 = 1.4

x 1 2 3 4 5

P(X = x) 0.1 0.3 0.2 0.3 0.1



(12)

Properties of Var(X)

Var(c) = 0, where c is any constant

Var(cX) = c2Var(X)

Var(cX + d) = c2Var(X), where d is a constant

Proof:

Var(cX) = E(c2X2) – [E(cX)]2 = c2E(X2) – [cE(X)]2

= c2E(X2) – c2[E(X)]2 = c2(E(X2) – [E(X)]2)

= c2Var(X)



(13)

Discussion for Today’s Problem

Winning

Combination

3 Lemons 3 Oranges 3 Apples 3 Cherries

Payout $1 $2 $3 $5

Winning

Combination

4 Lemons 4 Oranges 4 Apples 4 Cherries

Payout $3 $4 $6 $12



(14)


P(lemons) = 0.4, P(oranges) = 0.3,

P(apples) = 0.2, P(cherries) = 0.1

Calculations for:

3 fruits of the same kind:

e.g. P(3 lemons) = 4C3 (0.4)3 (1-0.4)1 = 0.1536

4 fruits of the same kind

e.g. P(4 oranges) = (0.3)4 = 0.0081

P(James wins) = 1 – P(No winning combinations appear)



(15)


x $1(no win)

$0(3 lemons)

-$1(3 oranges)

-$2(3 apples)

-$4(3 cherries)

P(X=x) 0.7062 0.1536 0.0756 0.0256 0.0036

x -$2(4 lemons)

-$3(4 oranges)

-$5(4 apples)

-$11(4 cherries)

P(X=x) 0.0256 0.0081 0.0016 0.0001

Let the cost of one play be $1 and $x be James’ profit per

play.

Expected profit per play, E(X) = $0.480

Variance of profit, Var(X) = 0.938



(16)


By doubling the payout and doubling the price to

play, the expected value of James’ profit will double

and the variance will increase by four times.

• E(2X) = 2E(X) = $0.961

• Var(2X) = 4Var(X) = 3.754

• James will want to increase his expected profit and reduce

the variance so that his earnings will be more certain.

• This can be achieved by reducing the number of winning

combinations (e.g. no win for 3 lemons), increasing the cost

of play and/or reducing the payout for the 3 lemons winning

combination



(17)

Learning Outcomes

• Discrete Random Variable

• Probability Density Function (pdf)

• Expectation

• Variance





P06 – UNDERWEIGHT OR NOT







SCHOOL OF

ENGINEERING

Page 2 of 2

Underweight or Not You are a purchaser in a food company. Recently, your supplier for frozen chicken fillet has been bought over by a competitor and the company initiated a major change in the packaging and production method. Even though the agreement for the supply of the fillet remains unchanged, you are concerned that the amount of fillet in the new packaging may be different. One day, you carried out a sampling check on 40 packets of chicken fillet and collected the following data:

P6 Data.xls

Past records show that on the average, out of 100 packets of chicken fillet, 16 packets were underweight. Your company wants to know if the claim by the supplier that the weight of the packet remains unchanged is substantiated by the data. How do you decide from the data collected? If similar checks were to be carried out in the future, what acceptance criteria should be used?



P06 – Underweight or Not



(2)

Probability Problems with 2 Outcomes

• Many types of probability problems have only two outcomes or can be reduced to two outcomes.

• For example:– When a coin is tossed, it can land heads or tails.

– When a baby is born, it will be male or female.

– In an examination, you either pass or fail.

• Situations that can be reduced to 2 outcomes:– A medical treatment can be classified as effective or ineffective.

– A person can be classified a having normal or abnormal blood pressure, depending on the measure of the blood pressure.

– A multiple-choice question response, although there are 4 or 5 answer choices, can be classified as correct or incorrect.

• Situations like these are called binomial experiments.



(3)

Binomial Experiment

• A binomial experiment is an experiment that satisfies the following properties:

– Experiment consists of n repeated trials.

– Each trial has two possible outcomes: success or failure.

– Probability of success, denoted by p, is the same in each trial.

– Repeated trials are independent.

• Outcomes of a binomial experiment and the corresponding probabilities of these outcomes are called a binomial distribution.



(4)

Binomial Distribution

• Let X be the number of successes in n trials of a binomial experiment

• X is called a binomial random variable with pdf given by:

P(X = r) = nCrpr(1-p)n-r, where r = 0,1,2,…,n

• p is the probability of success.

• It can also be expressed as X ~ B(n,p).



Example: Positive and Negatively Skewed


(5)

Right-skewed Left-skewed



(6)

An Example

A coin is tossed three times. Find the probability of getting exactly two heads.

– This problem can be solved by looking at the sample space:

HHH, HHT, HTH, THH, TTH, THT, HTT, TTT

– There are 3 ways to get 2 heads, therefore,

• P(exactly 2 heads) = 3/8 = 0.375



(7)

An Example – Coin Toss, P(2 heads)

• Consists of three trials (tosses)

• Each trial has only two possible outcomes: heads or tails

• Probability of success (head) = 0.5 for each trial

• Outcomes are independent of each other (the outcome of one toss does not affect the outcome of the other tosses)

Solution:

• Applying Binomial Distribution, let X be the random variable representing the number of heads

• X ~ B(3, 0.5)

• P(X = 2) = 3C2(0.5)2(0.5)1 = 0.375



(8)

An Example

There are five multiple choice questions in a test. Each question has five possible choices. If a student randomly guesses the answers to all five questions, find the probability that he gets exactly three correct.

Solution:

Let X be the r.v. representing the number of correct answers.

X ~ B(5, 0.2)

P(X = 3) = 5C3(0.2)3(0.8)2 = 0.0512



(9)

Mean and Variance for Binomial Distribution

Let X ~ B(n,p), then

E(X) = µ = np

Var(X) = σ2 = np(1-p)



(10)

An Example

A die is rolled 480 times. Find the mean, variance and standard deviation of the number of 2’s obtained.

Solution:

Let X be the r.v. representing the number of 2’s obtained.

µ = np = 480 x 1/6 = 80

σ2 = np(1-p) = 480 x 1/6 x 5/6 = 66.67

σ = sqrt[np(1-p)] = sqrt(66.67) = 8.16



(11)

Today’s Problem

Let X be the random variable representing the number of underweight

fillets. Assuming probability of the fillet being underweight is the same

for all packets and the result of each weighing is independent,

X ~ binomial (40, 16/100)

Average number of underweight fillets in a sample of 40 is

E(X) = 0.16*40 = 6.4

The variance is Var(X) = 5.38 and Standard Deviation = 2.32. This

means that most of the checks should yield between 4 and 9

underweight packets.

Using Excel, work out the probability of each x (x is from 0 to 40) using:

BINOMDIST(x, 40, 16/100, 0)



(12)

Today’s Problem

From the graph, it is seen that the probability of getting exactly 8

packets of underweight fillets from the check is 0.125. However, we

cannot make our decision on this probability.



(13)

Today’s Problem

It is more useful to determine the cumulative probability in setting the

acceptance criterion. For example, if the acceptance criterion is 9 or

less, then it means that the probability of getting more than 9 packets of

underweight fillets is 1-0.9= 0.1, which is unlikely.

x P(X<=x)

0 0.00

1 0.01

2 0.03

3 0.10

4 0.21

5 0.37

6 0.54

7 0.70

8 0.82

9 0.90

10 0.95

11 0.98

12 0.99

13 1.00



(14)

Learning Outcomes


– Properties

– Probabilities

– Mean

– Variance





P07 – ENOUGH AMBULANCES?







SCHOOL OF

ENGINEERING

Page 2 of 2

Enough Ambulances? A Straits Times article on Jan 28, 2009 reported that Singapore Civil Service Defence Force (SCDF) is planning to open up its emergency ambulance service to the private sector. This is in response to the growing number of emergency calls that is expected to further increase with a growing and aging population. SCDF wants to add ten more private ambulances to its current fleet of forty emergency ambulances. According to the report, a total of 111,127 emergency calls were made to SCDF last year, 9 per cent more than the calls received in 2007. The number of prank calls to the emergency lines also went up to at least 11 calls per day. Being a statistics student, you are naturally curious about the numbers. Doing a quick mental calculation, you worked out that the average number of calls made daily is 316 including prank calls. However, knowing that incidences of emergencies occur randomly, you wonder if adding ten more ambulances will be sufficient for SCDF’s needs. How do you think the statistical nature of the problem is considered in the planning? Assuming that your team is consultant to SCDF, conduct a study based on the numbers given, make reasonable assumptions and present your findings.

http://images.google.com.sg/imgres?imgurl=http://www.lentorresidence.com/images/am%25202.jpg&imgrefurl=http://www.lentorresidence.com/ambulance.shtm&usg=__ByetDgP4dpmRp4ILmqsJ8rcwdpY=&h=376&w=507&sz=202&hl=en&start=28&um=1&tbnid=EugYihvdyeDZXM:&tbnh=97&tbnw=131&prev=/images%3Fq%3Dambulances%26ndsp%3D20%26hl%3Den%26cr%3DcountrySG%26sa%3DN%26start%3D20%26um%3D1



P07 – Enough Ambulances?



(2)

Recall: Probability Distribution

• A probability distribution lists all the outcomes of an experiment and the probabilities associated with each outcome.

• It describes the likelihood of some future event.

• Two important characteristics of a probability distribution are:– The probability of a particular outcome is between 0

and 1, inclusive.

– The sum of the probabilities of all mutually exclusive events is 1.0.



(3)

Recall: Random Variable

• In any experiment of chance, the outcomes

occur randomly. These quantities are called

Random Variables.

• Random Variables can be Discrete or

Continuous.

– Discrete random variables can assume only certain

clearly separated values (countable).

– Continuous random variables can assume one of an

infinitely large number of values (measurable)



(4)

Poisson Distribution

• Poisson probability distribution describes the number of times some event occurs during a specified interval.

• Interval may be time, distance, area, or volume.

• Poisson distribution is based on two assumptions:– Probability of a “success” is proportional to the length of the

interval

– Intervals are independent

• The longer the interval, the larger the probability and the number of occurrences in one interval does not affect the other intervals

• It is a discrete probability distribution because it is formed by counting.



(5)

Poisson Random Variable

Examples of Poisson Random Variables:

• Number of people who arrived at a hospital emergency

room in 1-hour interval

• No of customers queuing up at a POSB bank counter

• Number of flaws (cracks and deep scratches) in an area

of ceramic flooring in a newly built HDB flat

Interval

In a Poisson process, events

occur at random in an interval



(6)

Poisson Distribution

!)(

x

exP

x

The Poisson Distribution is given by the formula:

Where:

is the mean number of occurrences (successes) in a particular interval

x is the number of occurrences (successes)

e is the constant 2.71828 (base of the natural logarithm)

P(x) is the probability for a specified value of x

When X is a Poisson variable, we write

X~Poisson(), or,

X ~ Po()

for x = 0,1,2,3…



(7)

Characteristics of Poisson Distribution

Expected Value:

E(X) =

Variance:Var(X) = np(1 – p)

= (1 – p)

As p tends to zero, then Var(X) =

Poisson Distribution has the same Expected Value and Variance.



(8)

Approximation of Poisson Distribution

• Poisson Distribution can be used to approximate binomial distribution B(n,p) when n is large and p is small

• 2 general rules-of-thumb:

– n≥20 and p≤0.1 or

– n≥100 and np≤10



(9)

Example

Given = 5 cars arriving in a 5-minute period,

• Probability of 8 cars arriving in 5 minute period,

P(X=8) = (58 x e-5)/8! = 0.065

• Probability of more than 6 cars arriving in 5 minute period,

P(X>6) = 1-P(X<=6) = 0.24

[Using Excel, 1- Poisson(6,5,1)]

• Mean number of cars arriving in 1 hour = 12 x 5 = 60



(10)

Example

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Poisson probability distribution (= 5)



(11)

Today’s Problem

• 111,127 calls are received in 2008. Projecting a 10% increase, the number of calls in 2009 would be 122,240.

• Average no. of calls in a day is 335+12(prank calls)= 347

• Average calls per hour is 14.5

• A few assumptions are required:– The average duration an ambulance is engaged during a call is 1

hour

– The distribution of calls throughout the day is not uniform. Assume that there is a peak hour each day and that the number of calls during the peak hour is 2 times the average no. of calls, i.e. 29

– All 40 ambulances are available at all times (together with the associated manpower and equipment)



(12)

Today’s Problem

• Are the number of ambulances sufficient? To answer this question, we have to calculate the probability that there will be more than 40 calls made in 1 hour:

• From Excel, Poisson(40,29,1) = 0.979

• Hence there is approximately 2% chance of running out of ambulances

• This may seem low but if we assume that the peak hour occurs everyday, then in one year, there is more than 7 incidents where there is a shortage of ambulances. It could mean 7 lives lost!

)40(1)40( XPXP



(13)

Today’s Problem

• If 10 ambulances are added, from Excel, Poisson(50,29,1) = 0.9999. There is almost zero chance of shortage.

• In this case, will there be too many ambulances? What percent risk is acceptable?

• Do bear in mind this is a statistical exercise (see footnote). Other factors that should be considered include the availability of manpower and equipment, the reliability of vehicle, the response time requirement, the likelihood of a disease outbreak, cost involved, etc. Can you think of any others?



(14)

Learning Outcomes

• Poisson Probability Distribution

• Poisson Random Variables

• Characteristics of a Poisson Distribution





P08 – OF PISTONS AND CYLINDERS







SCHOOL OF

ENGINEERING

Page 2 of 2

Of Pistons and Cylinders

You are an engineer working for an engine manufacturer. Your company has received a few complaints from customers about a recently launched engine. The engine sometimes does not perform to specifications and fails to deliver the stated torque. An investigation reveals that the cause is sub-performance of the main piston. In order to perform optimally, strict specifications require that the gap between the piston and cylinder be between 0.12 and 0.40mm. After eliminating the likelihood of problems in the assembly process, you turn your attention to the dimensions of the piston and cylinder. A request is made to the supplier of the components for the exact diameter measurements of all the pistons and cylinders delivered so far.

Data.xls

What can you conclude from the data? Suppose the supplier offers a new type of piston that is touted to deliver better performance. Your colleague John asked you to conduct a sampling check on the diameter of the new pistons. You need to measure the diameters for 10 pistons and determine the mean diameter. John said that if the mean falls within one standard deviation (of the mean of the old pistons), the new pistons should be accepted. Is John’s approach correct?

http://images.google.com.sg/imgres?imgurl=http://www.indequip.in/products/a1.jpg&imgrefurl=http://www.indequip.in/products.asp&usg=__WcI_zTxGvfqMr_QJQ6oSRpfuNQY=&h=300&w=400&sz=14&hl=en&start=96&um=1&tbnid=17SIIGqWHkcqRM:&tbnh=93&tbnw=124&prev=/images%3Fq%3Dpistons%2Band%2Bcylinders%26ndsp%3D20%26hl%3Den%26sa%3DN%26start%3D80%26um%3D1



P08 – Of Pistons and Cylinders



(2)

Why Study Normal Distribution?

• Certain probability densities have so much importance

in statistics that areas under the curve have been

tabulated for future reference.

• One such distribution is the Normal, or Bell-shaped,

Distribution.

• This distribution is useful for describing variability in

industrial measurements such as lengths or weights.

• Natural variation in living organisms and their

characteristics also tend to follow a Normal

Distribution.



(3)

Properties of Normal Distribution Curve

• The normal distribution curve is bell-shaped.

• The mean, median and mode are equal and located at the centre of the

distribution.

• The curve is symmetrical about the mean.

• The standard deviation () specifies the amount of dispersion around the

mean.

• Two parameters and completely define a normal distribution curve.

• The further away from the mean the curve moves, the closer it gets to

the x-axis but it never touches.

• The curve is represented by the formula :

Where μ = mean, σ = std deviation, e = 2.718282, -∞<x< ∞



(4)

Histogram for blood pressure measurements

for sample of 118 men

0.1

.2. 3

.4.5

Perc

enta

ge o

f M

en

80 100 120 140 160

Systolic BP (mmHg)



(5)

Histogram for blood pressure measurements

for sample of 5000 men

0.1

.2.3

Perc

enta

ge o

f M

en

80 100 120 140

Blood Pressure (mmHg)



(6)

Examples of Normal Distribution Curves

Typical normal distribution

with mean=5 and variance=1

Two normal distributions with different

mean values and same variance

Two normal distributions with different variances

and the same mean



(7)

The Normal Distribution • Each normal density curve is completely defined by two parameters

– mean (average), represented by μ, and,

– standard deviation, represented by σ.



(8)

Interpreting the Normal Curve

a b

• Probability = Area under the curve = shaded region

• P (a<X<b) = area under the curve between a and b

Area under the curve is obtained using NORMDIST in Excel

or

In case of a Standard Normal Distribution, NORMSDIST.



(9)

Standard Normal Distribution

• Standard Normal Distribution is a distribution with mean 0 and variance 1

• It is represented by the standard normal variable Z where

Z = (X - µ) / σ



Linear Combinations of Normal R.V.

• Linear combinations of normal random variables are also

normally distributed

Linear Functions

• If X ~ N(µ,σ2) and a and b are constants, then

Y = aX + b ~ N(aµ+b, a2σ2)

Sum of Two Independent Normal R.V.

• If X1 ~ N(µ1,σ12) and X2 ~ N(µ2,σ2

2) are independent

r.v., then

Y = X1+X2 ~ N(µ1+µ2, σ12+σ2

2)(10)



Sampling

Sampling is a process of selecting a subset of data from the

population

Reasons for Sampling: Time Constraints

Cost Constraints

Impossibility of a Census

Population is infinite

Measuring process is destructive



Take k samples each of size n and calculate the sample mean for each

sample. Using these sample means, a distribution known as sampling

distribution of the mean can be obtained.

Sample Observations Sample Mean

1 X1,1 X1,2 …… X1,n 1

2 X2,1 X2,2 …… X2,n

2

k Xk,1 Xk,2 …… Xk,n k

Essentially, sampling distribution is the distribution of values for a sample

statistic obtained from repeated samples, all of the same size and all

drawn from the same population.

Sampling Distribution of Sample Mean

x

x

x



Let X1, X2,…, Xn denote a random sample selected from a population having mean µ and variance σ2.

Central Limit Theorem states that as sample size n increases (i.e. n ≥

30), the sampling distribution of the sample mean will:

1. Have a mean µ = µ

2. Have a standard deviation σ = σ / √n

3. Be approximately normally distributed

• The sampling distribution has a normal distribution if the population is

normally distributed. For other types of population, it will approximate

a normal distribution when n is large (rule of thumb, n ≥ 30)

• The standard deviation of the sample mean is known as the standard

error of the sample mean and is an indication of the accuracy of the

estimating the ‘true’ mean with sample mean

Central Limit Theorem

x

x

x



Central Limit Theorem (CLT)Normal Uniform Exponential

Population Distribution

Sample of n=2

Sample of n=5

Sample of n=30

Significance of CLT

It permits us to use

sample statistics to

make inferences about

the population

parameters without

knowing anything about

the specific shape of the

population distribution.



(15)

Today’s Problem

• Let the gap between piston and cylinder be

Y = X2 – X1

• It follows that Y would be normally distributed as it is a linear combination of normal random variables X1 and X2.

• Mean µY = µX2 – µX1 = 25.26 – 25.00 = 0.26

• Variance σY2 = σX2

2 + σX12 = 0.062 + 0.082 = 0.0094

• Hence Y ~ N(0.26, 0.0094)



(16)

Today’s Problem

• Probability that a piston will not fit in a cylinder is when the gap is less than zero,

P(Y<0) = P(z<(0- µY)/√σY)

= P(z<(0-0.26)/√0.0094) = P(z<-2.67)

= 0.0037

• Probability that a piston will perform optimally is when gap is between 0.12mm and 0.40mm,

P(0.12<Y<0.40) = P [(0.12-µY)/√σY< z <(0.40-µY)/√σY)]

= P [(0.12-0.26)/√0.0094 < z < (0.40-0.26)/√0.0094]

= P(z<1.44) – P(z<-1.44) = 0.851



(17)

Today’s Problem

• Assuming that diameters of the new pistons have the same distribution as the old ones, distribution of the sample mean can be written as:

~ N (µ, σ2/10)

• If underlying distribution of the new piston diameters is unknown or cannot be assumed as normal, then we can increase the sample size to 30 or more pistons so that by Central Limit Theorem, the sample mean is approximated to be normally distributed.

X



(18)

Today’s Problem

• The diameters of the piston within the interval [24.94, 25.06] represents one standard deviation from the mean.

• Probability that the sample mean lies within one s.d. is

P(µ-σ< <µ+σ)

= P [(µ-σ-µ)/√σ2/10 < z < (µ+σ-µ)/√σ2/10]

= P(z<0.06/(√0.062/10)) – P(z<-0.06/(√0.062/10))

= P(z<3.16) – P(z<-3.16)

= 0.9984

x

1 S.D.

Population

Distribution

Distribution of

Sample Mean



(19)

Today’s Problem

• Although there is a 68% chance that the diameters of the old pistons lie within one standard deviation of the mean, the average diameter of the ten new pistons has a 99.8% chance of falling within the interval if both have the same distribution.

• The acceptance criteria should be based on comparison with population mean and not the distribution of the population.

• Hypothesis testing of the mean should be conducted to check whether the diameter of the new pistons is the same.



(20)

Learning Outcomes

• Properties of a Normal Distribution Curve

• Standard Normal Distribution

– Standard Normal Variable z

– Applications of the Standard Normal

Distribution Table

• Central Limit Theorem

• Sampling Distribution of Sample Mean





P09 – CASE OR NO CASE







SCHOOL OF

ENGINEERING

Page 2 of 2

Case or No Case

A consumer product company, A & B, has being producing its popular foam facial wash for the past 10 years. A & B has always been proud of its ability to provide consumers with exciting foam height per pump of its liquid – to – foam facial foam. In a recent advertising campaign, A & B’s endorsing artiste has mentioned on national TV that the average foam height is 75 millimeters. Foam height is approximately normally distributed and has a standard deviation of 5 millimeters. Vivian, an avid blogger, has been intrigued by A & B’s claim. She decided to conduct her own experiment to challenge that the average foam height is not 75 millimeters. Vivian obtained 50 foam height data as attached below:

P09_Foam Height Data_AllTeams.xlsx

Vivian looked at the data and analyzed that she can reject A & B’s claim. Thus she wrote about her findings in her blog. Vivian wondered if her analysis will be affected if she is unaware of the standard deviation and is there an estimated sample size for a predefined error. Is Vivian’s conclusion valid? How will you help to address Vivian’s doubts?

Illustrative Figure on foam height



P09

Case or No Case



(2)

Statistical Hypotheses

• Many problems in daily life require that we decide whether to accept or reject a statement about some parameter

• The statement is called a hypothesis, and the decision-making procedure about the hypothesis is called hypothesis testing.

• A hypothesis is thus a claim or statement about a property of a population.



(3)

Terms used

• Significance Level is the probability of making a decision to reject the null hypothesis when the null hypothesis is actually true.

• Critical Region is the set of values for which we reject the null hypothesis.

• Critical Values determine the boundary between a decision whether or not to reject the null hypothesis.



(4)

Forming the Hypothesis

• Manufacturer’s claim– The average foam height per pump is 75 mm.

– This claim is commonly referred to as the null hypothesis,H0.

– The null hypothesis is presumed true unless we have enough evidence to reject it.

• Blogger’s suspicion

– The average foam height per pump is NOT 75 mm.

– This is commonly referred to as the alternative hypothesis, H1.



(5)

Null and Alternative Hypothesis

• Null Hypothesis– The null hypothesis is a statement of the value of a

population parameter.

– It tests whether the sample mean is the same as the population mean:

H0 : μ = 75

• Alternative Hypothesis– The alternative hypothesis (denoted by H1) is the

statement that must be true if the null hypothesis is false.

H1 : μ ≠ 75

– This is a two – tailed test.



(6)

One – Tailed and Two – Tailed Test

• Example:

Hypothesis Testing Problem Null and Alternative Hypothesis

Mean burn rate is not 50 cm/s H0 : μ = 50

H1 : μ ≠ 50

Mean burn rate is less than 50 cm/s H0 : μ = 50

H1 : μ < 50

Mean burn rate is more than 50 cm/s H0 : μ = 50

H1 : μ > 50



(7)

One – Tailed Test

• One – Tailed test– In such a test, the critical region is in the region of the

inequality of the alternative hypothesis

(i.e. < will be left, > will be right).

Null and Alternative

Hypothesis

Acceptance Region

H0 : μ = 50

H1 : μ < 50

H0 : μ = 175

H1 : μ > 175



(8)

Two – Tailed Test

• Two – Tailed Test– In such a test, the critical region is split into two parts,

with (usually) equal probability placed in each tail of the distribution of the test statistic.

Null and Alternative

Hypothesis

Acceptance Region

H0 : μ = 50

H1 : μ ≠ 50



(9)

Test Statistic

Variance

Known?

Sampling

Distribution

Test Test Statistic

Known

(Or Large

Sample

Size)

Normal

Distribution

Z – Test

Unknown 1 Sample

t – Distribution

(n-1 Degree of

freedom)

1 Sample t - Test

• Hypothesis Tests on Mean– When conducting hypothesis testing on mean of a normally

distributed population, variance can be either known or unknown, resulting in different estimated sample distribution.

ns

xt

n

xz



(10)

Critical Region and Value

• The critical region is the set of all values of the test statistic that

would cause rejection of the null hypothesis.

• The critical value is the value separating the critical region from

the values of the test statistic that would not lead to rejection of

the null hypothesis.

Significance

LevelOne – Tailed

Test

Two – Tailed

Test

5%(95%

Confidence

Level)

-1.645 -1.96 1.96



(11)

Critical Region and ValueSignificance

LevelOne – Tailed

Test

5%(95%

Confidence

Level)

-1.645

= NORMSINV(0.05) = -1.645



(12)

Critical Region and ValueSignificance

LevelTwo – Tailed

Test

5%(95%

Confidence

Level)

2

-1.96 1.96

= NORMSINV(0.025) = -1.96

= NORMSINV(0.975) = 1.96



(13)

Test Statistic: Example 1

• Suppose we are interested in the burn rate of a solid propellant

used to power aircrew escape systems, where it has been claimed

that the mean burn rate is 50 cm/s and hypothesis test problem is

mean burn rate not 50 cm/s.

– Given = 2.5 mm/sec; n = 50 (large sample size, thus Normal

sampling distribution); = 50.25 cm/s; 95% Confidence Level

Sol:

H0: µ = 50 cm/s

H1: µ ≠ 50 cm/s

x

x

-1.96 1.96

Normal Sampling Distribution:

Since z = 0.707 is in the acceptance

region, we will not reject H0.

n

xz

707.0505.2

5025.50

z

At 95% CI



(14)

Test Statistic: Example 2

• Suppose we are interested in the burn rate of a solid propellant

used to power aircrew escape systems, where it has been claimed

that the mean burn rate is 50 cm/s and hypothesis test problem is

mean burn rate not 50 cm/s.

– Given = 2.5 mm/sec; n = 50 (large sample size, thus Normal

sampling distribution); = 46.55 cm/s; 95% Confidence Level

Sol:

H0: µ = 50 cm/s

H1: µ ≠ 50 cm/s

x

x

-1.96 1.96

At 95% CI


Since z = -9.76 is NOT in the

acceptance region, we will reject H0.

n

xz

76.9505.2

5055.46

z



(15)

Proposed Solution (Known Variance)

• Hypothesis:

– H0: µ = 75 mm

– H1: µ ≠ 75 mm

-1.96 1.96

At 95% CI


Since z = 7.382 is NOT in the

acceptance region, we will reject H0.

n

xz

382.7505

7522.08

z



(16)

Proposed Solution (Unknown Variance)

• Hypothesis:

– H0: µ = 75 mm

– H1: µ ≠ 75 mm

At 95% CI, n-1= 49

degree of freedom

1 Sample t-Test Sampling Distribution:

Since t = 6.28 is NOT in the acceptance

region, we will reject H0.

ns

xt

28.650877.5

7522.08

t

-2.01 2.01 Can also use Excel function TINV(0.1,49)



(17)

Proposed Solution (Estimate Sample Size)

• Hypothesis:

– Given: = 0.1, E = 1.5 mm, = 5,

2

2/

E

zn

31

2

5.1

5)645.1(



(18)

Proposed Solution (HT Methodology)

From Problem, identify parameter of interest

State Null Hypothesis, H0

Specify appropriate alternative hypothesis, H1

Choose a significance level,

Determine an appropriate test statistic

State the rejection region for the statistic

Decide whether or not H0 should be rejected and report in

problem contextss

Compute any necessary sample quantities, substitute

these into the equation for the test statistic, and compute

that value



(19)

Learning Outcomes

• Hypothesis Testing– Null and Alternative Hypothesis (One-tailed and Two-

tailed)

– Significance Level

– Test Statistic

– Methodology

• Z – Test

• 1 Sample t – Test

• Estimate Sample Size





P10 – CHARGED OR RECHARGE







SCHOOL OF

ENGINEERING

Page 2 of 2

Charged or Recharge

The advertised claim for ABC batteries for mobile phones is set at 48 operating hours, with proper charging procedures. A study of 5000 batteries is carried out and 7 stop operating prior to 48 hours. Do these experimental results support the claim that less than 0.2 percent of the company’s batteries will fail during the advertised time period, with proper charging procedures?

Perform a hypothesis – testing procedure with = 0.05, discussing the errors that could arise when a wrong decision is made from the result. Can you also estimate the confidence interval for which the mean percent of batteries will fail, based on the experimental results? What is the relationship between confidence Interval and hypothesis testing outcome?



P10

Charged or Recharge



(2)

Proportions

• Proportions provide useful information in summary

• Hypothesis testing can be applied not only to absolute data (such as sample mean), but also for population proportions.



(3)

Significance of Proportions



(4)

Assumptions

Certain assumptions must be made when testing a claim about a population proportion, probability or percentage:

1. The conditions for a binomial experiment are satisfied.

That is, there are a fixed number of independent trials having constant probabilities, and each trial has only two possible outcomes.

2. The conditions np ≥ 5 and nq ≥ 5 are both satisfied, so that the binomial distribution of sample proportions can be approximated by a normal distribution with = npand npq



n

pq

pp ˆ

p̂n

x

Terms used

Notations used for hypothesis testing of

one proportion: n = number of trials

= (sample proportion), where x is the number of

‘success’ considered

p = population proportion

q = 1 – p

Test Statistics

Z =



(6)

Recall: Steps in Hypothesis TestingFrom Problem, identify parameter of interest

State Null Hypothesis, H0

Specify appropriate alternative hypothesis, H1

Choose a significance level,

Determine an appropriate test statistic

State the rejection region for the statistic

Decide whether or not H0 should be rejected and report in

problem context

Compute any necessary sample quantities, substitute

these into the equation for the test statistic, and compute

that value



(7)

Proposed SolutionStep 1: The parameter of interest is the proportion of

batteries that fail during advertised period, p

Step 2: Null Hypothesis, H0: p = 0.002

Step 3: Alternative hypothesis, H1: p<0.002

(This formulation will allow manufacturers to make a strong claim about

the proportion of batteries that fail if the null hypothesis, H0: p=0.002 is

rejected)

Step 4: Chosen significance level is = 0.05



(8)

Proposed SolutionStep 5: The test statistic to be used is:

Step 6: Reject H0: p = 0.002 if Z0<Z0.05 = -1.645

Step 7: Compute the test statistic:

n

pq

ppZ

0

n

pq

ppZ

0

5000

998.0*002.0

002.0)5000/7(

950.0



(9)

Proposed SolutionStep 8: Conclusions:

Since Z0 = -0.95 is not < Z0.05 = -1.645, we cannot

reject H0 and conclude that the

manufacturers cannot claim that less than 0.2

percent of the company’s batteries will fail

during the advertised time period at = 0.05



(10)

Type I () and Type II errors ()

Decision H0 is Actually True H0 is Actually False

Fail to reject H0 No error

(1-) Type II error, (failing to reject a false

null hypothesis)

Reject H0Type I error,

(rejecting a true null

hypothesis)

No error

(1-)



Type I () and Type II errors ()

• β usually cannot be calculated as it depends on the actual

difference between the hypothesized value of the parameter

and the true value (we don’t know the true value!).

• 1- β is also known as Power of a Test. It measures the

sensitivity of the test to detect a real difference in

parameters if one actually exists.

• Larger results in a smaller β, and smaller results in a

larger β.

• To increase the Power, either increase the value of and/or

increase the sample size (which would reduce as well).

This would reduce the confidence interval of the sample

parameter and increase the ‘precision’ of the experiment.

(11)



(12)

Type I and Type II errors: Example


Fail to reject H0

(Decide that company is

NOT unfair)

No error

(1-) Type II error, (Not sue company when they

are ACTUALLY unfair)

Reject H0

(Decide that company is

unfair)

Type I error, (Suing company when they are

NOT unfair)

No error

(1-)

Suppose that you are a lawyer that is trying to establish that a

company has been unfair to workers above 50 years old with

regard to salary increases. Suppose the mean salary increase

per year is 8%.

H0: µ= 0.08 ; H1: µ < 0.08



(13)

Type I and Type II errors: Problem


Fail to reject H0

(Decide that NOT less than

0.2% batteries fail)

No error

(1-) Type II error, (Do not accept less than

0.2% batteries fail when it is true)

Reject H0

(Decide that less than

0.2% batteries fail)

Type I error, (Accepting less than

0.2% batteries fail when it is

NOT true)

No error

(1-)

The advertised claim for ABC batteries for mobile phones is set at

48 operating hours, manufacturer’s claim that less than 0.2

percent of the company’s batteries will fail

H0: p= 0.002 ; H1: p < 0.002



(14)

Confidence Interval - Definition

• Confidence interval gives an estimated range of values

which is likely to include an unknown population parameter,

the estimated range being calculated from a given set of

sample data.

• The width of the confidence interval gives us some idea

about how uncertain we are about the unknown parameter. A

very wide interval may indicate that more data should be

collected before anything very definite can be said about the

parameter.



(15)

Confidence Interval and Hypothesis Testing Outcome

• Close relationship between confidence intervals

and hypothesis testing

• Examples:

1. At 95% confidence interval, all values in the interval are

considered plausible values for the parameter being

estimated. If the value of the parameter specified by the

null hypothesis is contained in the 95% interval then the

null hypothesis cannot be rejected at the 0.05 level.

2. At 99% confidence interval, values outside the interval

are rejected at the 0.01 level.



(16)

2-tailed Confidence Interval for Problem Statement

Based on the sampling of 5000 batteries, the 95% 2-

tailed confidence interval for mean proportion of

batteries that will fail the specs is:

||||

^^

025.0

^^^

025.0

^

n

qpZpp

n

qpZp

5000

)9986.0(0014.096.10014.0

5000

)9986.0(0014.096.10014.0 p

00244.000036.0 p



(17)

1-tailed Confidence Interval for Problem Statement

The 95% 1-tailed confidence interval is:

Since the null hypothesis proportion value is 0.002

and lies within the interval, we cannot reject H0 at

0.05 level of significance.

|0

^^

05.0

^

n

qpZpp

5000

)9986.0(0014.065.10014.00 p

00227.00 p



(18)

Learning Outcomes

• What are Proportions and its significance

• Hypothesis Testing a Proportion

• Assumptions when testing a claim about

Proportion

• Type I and Type II error

• Confidence Interval and Hypothesis Testing





P11 – TRUE OR NOT TRUE







SCHOOL OF

ENGINEERING

Page 2 of 2

True or Not True

It has been reported that Singapore youths (aged 15 – 24) spend the longest hours daily in the region on instant messaging. Attached are the data collected from 10 youths (age 15 – 24) from South Korea, a country known for superb IT infrastructure and high internet usage rates

P11-Students response.xls

How can we prove whether the report is true or not true with reasonable statistical confidence, assuming the populations under comparison have equal variances?



P11

True or Not True



(2)

Testing Between 2 Samples

• There are many cases where researchers wish to compare 2 sample means. For example:– Is there a difference between the average lifetimes of 2 different brand

of tires?

– Did the students from college A score better in a common exams

compared with those from college B?

– How does the mean selling price of 4-room flats in one town compare

with another one?

– Have the soldiers’ fitness levels improved after training?

• To answer the above questions, we would collectdata for 2 samples and compare them by testing tosee if there is a statistically significant differencebetween the means



(3)

Case 1: 2 Sample z-test

• If we sample from 2 normal populations that are independent of each other (meaning no relationship between the subjects in each sample), and the standard deviation of each variable is known, then we use z-test for comparing the 2 means:

• If population standard deviation is unknown, both sample sizes must be 30 or more and replace σ with sample standard deviation, s.

2

2

2

1

2

1

21

nn

XXz



(4)

z-Test : Example 1

• The same physical fitness test was given to a group of 100 scouts

and 144 guides. The maximum score was 30. The guides obtained

a mean score of 26.81 and the scouts obtained a mean score of

27.53. Assuming that the fitness scores are normally distributed

with a common population standard deviation of 3.48, test at 95%

confidence interval whether the guides did not do as well as the

scouts in the fitness test.

Solution:

– Let X1 be the guide’s score and let the population mean be µ1.

X1 ~ N (µ1, 2)

– Let X2 be the guide’s score and let the population mean be µ2.

X2 ~ N (µ2, 2)

– Given = 3.48; n1 = 144, n2 = 100, X1 =26.81, X2 =27.53 ;

95% Confidence Level

x



(5)

z-Test : Example 1Solution:

Hypothesis:

H0: µ1 - µ2 = 0

H1: µ1 - µ2 < 0 (1-tailed test since we are interested to find out if the

guides did not perform as well as the scouts)

At = 0.05, critical z-value is -1.645

Using:

Since z calculated is > -1.645, we do not reject H0. Thus there is no

evidence, at 5% level, that the guides did not perform as well as

the scouts in the fitness tests.

2

2

2

1

2

1

21

nn

XXz

589.1

100

48.3

144

48.3

53.2781.26

22



(6)

Case 2: 2 Sample t-test

• If we sample from 2 independent normal populationswith unknown variances and the sample sizes are small, then we use 2-sample t-test for comparing the 2 means:

Where degrees of freedom are equal to n1 + n2 – 2

• The above t-test assumes that the variances of the populations are equal.

2121

2

22

2

11

21

11

2

)1()1(

nnnn

snsn

XXt



(7)

Is it reasonable to assume same variances between

populations for t-Test?

• In the t-test, the population variances are unknown so often we do not know if the variances can be assumed to be equal.

• If the population variances are very different, the 2-sample t-test may not be accurate as the results may be influenced by the difference in the variances.

• However, the 2-sample t-test is not overly sensitive to small differences between population variances so most of the times this test can be used.



(8)

2 Sample t-Test: One-sided Vs Two-sided

Left-tailed Test Right-Tailed

Test

Two-Tailed Test

H0: 1 - 2 ≥ 0

H1: 1 - 2 < 0

H0: 1 - 2 ≤ 0

H1: 1 - 2 > 0

H0: 1 - 2 = 0

H1: 1 - 2 ≠ 0

t0

.10

Reject

t0

.10

Reject

t0

.05

Reject

.05

Reject



(9)

P-Value Method for Hypothesis Testing

• Hypothesis testing commonly uses level of significance α of 0.05 or 0.1 which is Type I error.

• P-value represents the calculated probability of getting the sample statistic. It is the actual area under the distribution curve.

1.0

P-value

= 0.159

Z

Example

Use NORMSDIST(-1) function in Excel

to find one-sided P-value of standard

normal distribution (z=1.0):

P(Z=1.0)

= 0.159



(10)

Interpretation of P-Value

There are 2 ways to interpret the p-value. Assuming the

null hypothesis is true, p-value is the probability of:• Getting a test statistic like the one calculated or even

more extreme value

• Rejecting the null hypothesis when it is true

It answers this question – ‘To what extent does the data

support the null hypothesis?’

The smaller the p-value, the less the data supports the null hypothesis.



(11)

Decision Making Based on P-Value

α Criteria

• If p-value is smaller than or equal to level of significance α, reject null hypothesis.

• If p-value is greater than level of significance α, do not reject null hypothesis



(12)

Conventional Interpretation of P-values

•P > 0.10

Result is not significant

•0.05 < P < 0.10

Result is marginally

significant

•0.01 < P < 0.05

Result is significant

•P < 0.01

Result is highly significant

This is a rule-of-

thumb interpretation

without the need to

set α value.



(13)

Proposed SolutionLet Sample 1 be Singapore youths’ instant messaging

time, and

Sample 2 be Korean youths’ instant messaging time

Hypothesis:

H0: µ1 - µ2 = 0

H1: µ1 - µ2 > 0 (1-tailed test since we are proving

whether the report is true or not true)



(14)

Proposed SolutionGiven,

n1=20,n2=10,s1=100.85,s2=24.024,X1=268,X2=166.6

Using:

t-Statistic = 3.11 and p-value = 0.00214

[Excel function TDIST(3.11,28,1)]

Since p-value is <<0.05, reject null hypothesis

I.e. The report that Singaporean youths spending

most time on internet messaging is true

2121

2

22

2

11

21

11

2

)1()1(

nnnn

snsn

XXt



(15)

Learning Outcomes

• To test the hypotheses about the difference

between two population means

• Test Statistic for the difference between two

means (independent normal populations)

– z-Test (known variances)

– 2 sample t-Test (unknown equal variances)

• p-Value calculation and significance

• 2 Sample t-Test





P12 – WHO TYPES FASTER






SCHOOL OF

ENGINEERING

Page 2 of 2

Engineering, Republic Polytechnic, Singapore. Who Types Faster Some people believe that women in general can type faster than men since there are more female administrative staff compared to male ones. Others think that men have better hand-eye coordination and thus can type faster. Carry out a hypothesis test to determine if there is basis to further investigate those beliefs. You may carry out an experiment in your class and collect the relevant data by making use of typing tests provided in the following website: http://www.powertyping.com/typing_test/typing_test.shtml A second thought about typing speed is that typing an article containing non-English names will decrease typing speed. To find out whether this is true, conduct another study using hypothesis testing. Should you use the same test statistic for the two hypothesis tests? Meaning, is there any difference in the two studies in relation to assumptions about the population distribution, the relationship of the samples and the parameter under testing?

http://www.powertyping.com/typing_test/typing_test.shtml



P12 – Who Types Faster



Recall: Testing of 2 Sample Means

• If we are comparing the means of 2 independent normal populations

with unknown variances and sampling sizes are small, 2-sample t-

test statistic can be applied:

Where degrees of freedom are equal to n1 + n2 – 2

• This t-test requires that the variances of the populations be equal.

2

2121

2

22

2

11

21

11

2

)1()1(

nnnn

snsn

XXt



What if population variances are different?

• Often there is reason to suspect that variances between

2 populations may be very different. For example, output

of a newly set-up process Vs a long-run stable process.

• We may examine the sample variances. As a rule of

thumb, if the difference is 4 times or more, then we

cannot assume the population variances to be equal.

• In this case, a more appropriate test, called the Smith-

Satterthwaite Test, can be used. It is also known as the

2-sample t-test with unequal variances.

3



Smith-Satterthwaite Test (2-Sample T-test with

Unequal Variances)

• When comparing the means of 2 independent samples from normal

populations whose variances are unknown and unequal, use the

following test statistic:

which is a random variable that approximates to t-distribution with

degrees of freedom equal to (round down to nearest integer):

4

2

2

2

1

2

1

21

n

s

n

s

XXt


E214 Statistical Methods for Engineering5

T-Test with Unequal Variances: An Example

A researcher wants to determine whether the salaries of professional nurses employed in private hospitals are higher than those employed by government hospitals.

Data collected:

At 99% confidence level, can it be concluded that the private hospitals pay more than the government ones?

Private Government

10

800$

26800$

1

1

1

n

s

X

8

400$

25700$

1

1

2

n

s

X



T-Test with Unequal Variances: An Example

Let μ1 and μ2 be the average salaries of nurses in private and government hospitals respectively.

Ho: μ1 = μ2 and H1: μ1 > μ2 (Right-tailed T-test)

Assuming variances are not equal,

P-value is 0.0011 which is smaller than α = 0.01. Hence reject the null hypothesis. I.e. private hospitals pay nurses more than government ones.

13

80.3

8

400

10

800

2570026800

22

2

2

2

1

2

1

21

DOF

n

s

n

s

XXt



Dependent or Related Samples

• Sometimes, samples under study are related or they

contain the same subjects but under different conditions.

In this case, the samples are not independent of each

other and we cannot use 2-sample t-test.

• Examples of dependent samples:

– Performance of workers before and after a training program

– Effectiveness of a drug on patients

– Comparison of IQ scores of pairs of children matched with the

same age (to block out the differences in scores due to age)



Independent vs. Related Populations

• Independent Data

Sources

• Use Difference Between

the 2 Sample Means

• Same Data Source

– Paired/Matched

– Repeated Measures

(Before/After)

• Use Difference between

Each Pair of

Observations

Dn = X1n - X2n

Independent Related

21XX



Two Related Populations:

Paired Sample t-Test

• The paired sample t-test is used to test means of 2

related populations

– Paired or Matched samples

– Repeated Measures (Before/After)

• Eliminates variation among subjects in the same sample

• Assumptions

– If the sample is small, the distribution of difference scores

should be normally distributed

– Both Populations Are Normally Distributed

– If Not Normal, Can Be Approximated by Normal Distribution (n1

30 & n2 30 )



Paired Sample t-Test Statistic

Sample Mean

D

D

n

i

i

n

1

tDS

n

with df nD

1

Sample

Standard

Deviation

11

2)(

n

n

iD

iD

DS

In paired sample t-test, we test the mean of the differences

between each pair of subjects. The test statistic is:



Paired-Sample t-Test: An Example

To ascertain the effectiveness of a training program, the following test score data is collected:

Name Before Training After Training

Sam 85 94Tamika 94 87Brian 78 79Mike 87 88

At the 90% confidence level, determine the effectiveness of the training.



Paired-Sample t-Test:

Calculation of Test Statistic

Before After Difference

Sam 85 94 -9

Tamika 94 87 7

Brian 78 79 -1

Mike 87 88 -1

Total - 4

53.6

3

128

14

2))1(1((2))1(1((2))1(7((2))1(9((1

44

DSD



Paired Sample t-Test : Solution

H0: D 0 (D = B - A)

H1: D < 0

= 0.10

df = 4 - 1 = 3

Critical Value(s):

Test Statistic:

Decision:

Conclusion:

Do not reject at = 0.10

There is no evidence the

training is effective.t0-1.6377

.10

Reject

tDS

nD

1

6 5

4

.0.306



Problem Statement

• “Some people believe that women in general can type faster

than men since there are more female administrative staff

compared to male ones. Others think that men have better

hand-eye coordination and thus type faster. Carry out a

hypothesis test to determine if there is basis to further

investigate those beliefs.”

• How should you formulate H0 and H1?

• Do you use z test or t-test?

• What assumptions do you make in this test?



Solution: 2 Sample t-Test with unequal variances

Let 1 be typing speed (words/min) for female

Let 2 be typing speed (words/min) for male

Test Hypotheses:

H0: 1 - 2 = 0

H1: 1 - 2 <> 0 (2-tailed test as the results may show men type faster!)

Are the population variances known? Are they are the same?

No, we do not know what the population variances are or

whether they are the same or not. Let us apply 2 sample t-test

with unequal variances.



Calculations: 2 Sample t-Test with unequal variances

Assume the following has been calculated from data collected:

x1 = 31.42

x2 = 27.64

s1 = 6.35

s2 = 4.54

n1 = n2 = 10

• From Excel function TDIST(1.37,16,2) P(T>t or T<-t)

= 0.190 (p-value)

• Ho cannot be rejected at level of significance =0.1

16

37.1

8

54.4

8

35.6

64.2742.31

22

2

2

2

1

2

1

21

DOF

n

s

n

s

XXt



Paired t-Test• How can we prove if typing an article containing non-English names

affects the typing speed?

• In this case, since different people have different typing speeds, we should compare the speed of the same person typing two articles, one with and the other without non-English names.

• Paired t-test gives a more accurate result here as it is able to detect the differences (Xi1 – Xi2) to a greater extent compared with 2 sample t-test.

Person Article 1 typing speed Article 2 typing speed

1 X11 X12

2 X21 X22

3 X31 X32

4 X41 X42



Summary of Learning Outcomes

• Perform t-test for 2 independent samples with unequal variances

• Understand the difference between independent and related (dependent) populations

• Perform paired t-test for 2 dependent samples





P13 – TEST IT FOR FAIRNESS







SCHOOL OF

ENGINEERING

Page 2 of 2

Test it for Fairness You have just developed a program that simulates a six-sided dice. You are thinking of selling this as a product commercially as an embedded software application or an online tool. You want to test whether the dice is really fair, before you go ahead to launch it. How would you go about conducting a statistical test to decide whether the dice program you have developed is fair?



P13

Test it for Fairness



(2)

Test a distribution for Goodness-of-Fit using

Chi-Square

• Previously, we used statistical hypothesis to test

single population parameters.

• For today’s problem, we use statistical hypothesis to

determine if a population has a specified theoretical

distribution.

• This test is based on how good a fit we have between:

– the frequency of occurrence of observations in an observed

sample

– the expected frequencies obtained from the hypothesized

distribution



(3)

The Multinomial Experiment

• The experiment consists of n identical, independent trials.

• The outcome of each trial falls in one of k categories.

• The probabilities associated with the k outcomes, denoted by π1, π2, …, πk, remain the same from trial to trial. Since there are only k possible outcomes, we have π1 + π2 + … + πk = 1

• The experimenter records the values of o1, o2, …, ok, where oj (j = 1, 2, …, k) is equal to the observed number of trials in which the outcome is in category j.

Note that n = o1 + o2 + … + ok



(4)

The 2 Test Statistic

The 2 Test Statistic measures the amount of

disagreement between the observed data and

the expected data.

2 = ∑ (oj – ej)2 / ej

where the sum is over all categories, with oj being

the observed frequency count and ej the

expected frequency count in category j.



(5)

Test Statistic and its Applicability

Test Statistic: 2 = ∑ (oj – ej)2 / ej

with degrees of freedom equal to the number of categories minus 1 (right-tailed test), where

o = observed frequency

e = expected frequency

Assumptions for Chi-Square Goodness-of-Fit Test

1. The experiment satisfies the properties of a multinomial experiment.

2. No expected cell count, ej, is less than 5



(6)

Typical 2 Density Curve

The curve begins at zero is and skewed right. As the degrees

of freedom increase, the distribution stretches out along the

horizontal axis.



(7)

Step 1: State the Null and Alternative Hypotheses

H0: Newly developed die is fair

H1: Newly developed die is not fair



(8)

Step 2: Compute Expected (Ei) and Observed (Oi) Frequencies



(9)

Step 3: Decide on Rejection Criterion

Degrees of freedom

= Number of classes – number of restrictions

= 6 – 1 = 5

Test at 5% significance level,

Reject Ho if: 2(calc) > 2

5%, 5

i.e. if 2(calc) > 11.07



(10)

2 Distribution with 5 Degrees of Freedom

=0.05

2.05,5 = 11.07



(11)

Step 4: Conclusion

Since = 2(calc) = 2.12 < 11.07,

Ho is accepted.

Conclusion:

The newly developed die is fair and you can

confidently release it commercially.



(12)

Learning Outcomes

The Chi-Square Goodness-of-Fit Test

• Understand the basic properties of the multinomial experiment

• Know how to calculate the expected number of outcomes to fall in categories of a multinomial experiment

• Know the assumptions required for a chi-square goodness-of-fit test

• Know how to conduct a chi-square goodness of fit test





P14 – Testing with Signs







SCHOOL OF

ENGINEERING

Page 2 of 2

Testing with Signs Your statistics facilitator thinks that RP students have less sleep than a typical person in Singapore. Can you prove whether this is true with reasonable statistical confidence, if the median duration of sleep is known to be 7 hours in Singapore? Note that the distribution of sleep duration cannot be assumed normal. Consider a small sample size from your class and perform a sign test to test the claim.



P14

Testing with Signs



(2)

Nonparametric Tests

• Statistical tests such as z, t and F tests are called

parametric tests

• Parametric tests require the assumption that

sampling populations are normally distributed

• In situations where population distribution is not

normal, nonparametric (or distribution-free) tests

can be used



(3)

Pros and Cons of Nonparametric Tests

• Variables under test need not be

normally distributed

• Can be used to test hypotheses

that do not involve population

parameters, such as randomness

of sample, relationship between 2

samples

• Computations are generally

easier compared with parametric

tests

• Less sensitive than parametric

tests when normality assumption

is met. Thus, larger differences

are needed before null hypothesis

can be rejected

• Use less information than

parametric tests

• Less efficient in the sense that

larger sample size is required to

overcome loss of information

Advantages Disadvantages



(4)

Sign Test

• The sign test is used to test the value of a median

of a specific sample

• An alternative to 1-sample t-test or paired t-test

• Can be used for small sample size

• Assigns a ‘+’ to sample values above the

hypothesized median value and a ‘-’ to sample

values below the median

• Does not account for the difference between

values in the data and the median



If the probability (p-value) is smaller than significance level

, reject the null hypothesis. Conclude appropriately.

Compute the p-value based on binomial distribution with n, r and

p=0.5:

P(X<=r) if H1 contains ‘<‘, P(X>=r) if H1 contains ‘>‘ while p-value

for 2-sided test is twice the smaller p-value

(5)

Procedure in Sign Test

From Problem, identify the claim

State null (H0), alternative (H1) hypotheses and

significance level

For single-sample test, compare each

value with the hypothesized median. If

value is larger, replace with a ‘+’ sign. If

it is smaller, replace with a ‘-’ sign. If

equal, discard the value.

Count the number of ‘+’ (r) and the total number of signs (n)

For paired-sample test, subtract

each after value from the before

value and indicate the difference

with a ‘+’ or ‘-’ sign or 0. Discard the

‘0’ value(s).



(6)

Example 1: One-sample Sign Test

A researcher read that the median age for

viewers of the Singapore Idol show is 20 years.

To test the claim, 80 viewers were surveyed, and

30 were under the age of 20 years old while

exactly 4 were 20 years old.

At = 0.05, test the claim. Give one reason why

an advertiser might like to know the result of this

study.



(7)

Example 1: SolutionThe claim under test is that median age of viewers (ū) is 20 years.

Null Hypothesis: H0: ū = 20

Alternative Hypothesis: H1: ū ≠ 20

Letting n=76 (4 values are discarded), x=30, p=0.5, the probability of

getting 30 or less ‘-’ is:

P(X<=30) = 76Cr(0.5)r(1-0.5)76-r = 0.0423

P-value is 0.0423x2 = 0.0846 as this is a 2-tailed test.

Since p-value is greater than = 0.05, there is not enough evidence to

reject null hypothesis and we accept the claim that median age of

viewers is 20 years.

x

r 0



(8)

Example 2: Paired-sample Sign Test

The following are the average weekly losses of

worker-hours due to accidents at 13 industrial

sites before and after a certain safety program

was put into operation:

23 and 35, 41 and 30, 20 and 8, 28 and 35, 45 and 24, 83

and 77, 26 and 24, 17 and 11, 55 and 58, 29 and 25, 15

and 10, 28 and 22 and 37 and 35.

Use 0.05 level of significance to test whether the

safety program is effective.



(9)

Example 2: Solution

Let ūd be the mean difference in loss hours before and after the program.

Null Hypothesis: H0: ūd = 0 (safety program is not effective)

Alternative Hypothesis: H1: ūd > 0 (safety program is effective)

The 13 sample pairs yield: - + + - + + + + - + + + +

Letting n=13, x=10, p=0.5, the probability of getting 10 or more ‘+’ is:

P(X>=10) = 1- P(X<=9) =1- 13Cr(0.5)r(1-0.5)13-r = 0.0461

P-value is 0.0461 (1-tailed test)

Since p-value is smaller than = 0.05, reject null hypothesis and

conclude that the safety program is effective.

x

r 0



(10)

Problem Statement: Solution Example

Suppose data (average sleep duration in hours) from 16 students are collected as follows:

6.5 7 6 5

5.5 5 6 6

6.5 7 8 6.5

5.5 5 7.5 6



(11)


1. Claim is that RP students have on the average less than 7 hours of sleep every night.

2. Null Hypothesis:

H0: ū = 7

Alternative Hypothesis:

H1: ū < 7

3. Converting values into positive and negative signs, we have 2 ‘+’, 12 ‘-’ and 2 discarded values (tie)



(12)


4. Letting n=14, x=2, p=0.5, the probability of getting 2 or less ‘+’ is

P(X<=2) = 14Cr(0.5)r(1-0.5)14-r = 0.0065

Answer is calculated from Excel function BINOMDIST(2,14,0.5,1)

5. Since p-value is 0.0065 < =0.01, reject null hypothesis

6. We can confidently say that RP students has less than 7 hours of sleep per night.

x

r 0



(13)

Learning Outcomes

The Sign Test (Nonparametric Test)

• Understand when to apply nonparametric tests

• Know how to apply the Sign Test for small sample size where the normality assumption is not valid

• Know that the sign test is used to test population median for both one-sample and paired-sample tests





P15 – CHOCOLATE ADVERTISEMENTS







SCHOOL OF

ENGINEERING

Page 2 of 3

Chocolate Advertisements You are working in an advertisement company. Recently, your company won a contract to design a poster advertisement to attract people to try out a new series of chocolate newly launched by your customer. To kick off the project, you have been tasked to look into the current advertisement posters by other companies and evaluate their attractiveness. You have identified 3 posters below. Conduct an investigation to evaluate if there is any significant difference in the attractiveness of the 3 advertisement posters. Poster 1

SCHOOL OF

ENGINEERING

Page 3 of 3

Poster 2

Poster 3



P15 – Chocolate Advertisements



ANOVA

What is ANOVA?

Analysis of Variance (ANOVA) provides the tools to compare the means of several

populations with a single test.

The role of ANOVA is to perform a numerical test of significance that will test the equality

of all the means.



Underlying Assumptions for ANOVA

The F distribution is also used for testing whether two or more

sample means are from the same or equal populations.

This technique is called Analysis of Variance or ANOVA.

ANOVA requires the following conditions:

– The sampled populations follow the normal distribution.

– The populations have equal standard deviations.

– The samples are randomly selected and are

independent.



The F-Statistic

ANOVA is a procedure that compares the variability

between the samples to the variability within the

samples by computing the ratio

The F-statistic is a numerical measure of how much

the sample means differ.

samplesthewithiniance

samplesthebetweenianceF

___var

___var



Characteristics of the F Distribution

• Each member of the family is determined by two

parameters: the numerator degrees of freedom and

the denominator degrees of freedom.

• F cannot be negative, and is a continuous

distribution.

• The F distribution is positively skewed.

• Its value ranges from 0 to . As F , the

curve approaches the x-axis.



Procedure for the Analysis of Variance

• Null Hypothesis:

– Population means are the same.

• Alternative Hypothesis:

– At least one of the means is different.

• Test Statistic is the F Distribution.

• Decision rule is to reject the null hypothesis if

Fcalculated > Fcritical




• For k populations sampled, the numerator degrees of

freedom is (k – 1).

• For a total of n observations, the denominator degrees

of freedom is (n – k).

• The test statistic is computed by:

Where MS(Tr) is Mean Square for Treatments and MSE

is Mean Square Error

knSSE

kTrSS

MSE

TrMSF

1)()(




• SS(Tr) is the Treatment Sum of Squares.

where

TC is the column total

nc is the number of observations in each column

X the sum of all the observations

n the total number of observations

n

X

n

TTrSS

c

c

22

)(




• SST is the Total Sum of Squares

•SSE is the Sum of Squares Error

SS(Tr)- SST SSE

n

XXSST

2

2 )(



Example 1Specializing in meals for the elderly, a restaurant recently introduced vegetarian

porridge at three of its branches.

Data on number of vegetarian porridge ordered were collected over a period of 5

days. Assuming a 5% level of significance, determine if there is a difference in the

mean number of bowls ordered per day at the three branches.

Branch 1 Branch 2 Branch 3

Day 1 13 10 18

Day 2 12 12 16

Day 3 14 13 17

Day 4 12 11 17

Day 5 14 13 17



Example 1: Proposed Solution• SST:

• SS(Tr):

•SSE: SSE = SST – SS(Tr)

= 87 – 74.2

= 12.8

87

15

2092999

)(22

2

n

XXSST

2.74

15

)209(

5

85

5

59

5

65

)(

2222

22

n

X

n

TTrSS

c

c



Example 1: Proposed Solution (continued)

• Step 1:

H0: Mean number of porridge sold at the 3 branches are the

same

H1: Mean number of porridge sold at the 3 branches are not

the same

• Step 2:

– H0 is rejected if F > Fcritical

– Fcritical = 3.89 as there are 2 df in the numerator and 12 df

in the denominator



Example 1: Proposed Solution (continued)

• Calculating the value of F:

• The decision is to reject the null hypothesis as the

treatment means are not the same.

• The mean number of bowls of vegetarian porridge sold

at the three locations is not the same.

77.343158.12

132.74

1)(

knSSE

kTrSSF



Inferences About Treatment Means

When the null hypothesis that the means are

equal is rejected, it may be necessary to know

which treatment means differ.

One of the simplest procedures to determine

this is through the use of confidence intervals.



Confidence Interval for the

Difference Between Two Means

where

• t is obtained from the t table with degrees of freedom

(n - k),

• MSE (Mean Square Error) = [SSE/(n - k)]

)11

(t)(21

21

nnMSEXX



Example 2

Continuing from Example 1, develop a 95% confidence

interval for the difference in the mean number of bowls of

vegetarian porridge sold in Branch 2 and Branch 3.

Can management conclude that there is a difference

between the two branches?



Example 2: Proposed Solution• Confidence Interval:

(17 – 11.8) ± 2.179 √(1.067(1/5 + 1/5))

= 5.2 ± 1.424

= (3.776, 6.624)

Since zero is not in the interval, conclude that this pair of means differ.

Hence, the mean number of bowls of vegetarian porridge sold in Branch 2 is different from in Branch 3.



Today’s Problem

• Ho: The mean scores for the 3 posters are the

same

• H1: The mean scores for the 3 posters are not

the same

• Critical value = 3.124

• p-value = 0.0000056

• Since F = 14.3 > 3.124 & p-value < 0.05, we

reject the null hypothesis and conclude that

there is a significant difference in the scores for

the 3 posters.



Today’s Problem

Using Excel, the ANOVA output as follows:

Anova: Single Factor

SUMMARY

Groups Count Sum Average Variance

Poster 1 25 117 4.68 0.476666667

Poster 2 25 145 5.8 0.583333333

Poster 3 25 137 5.48 0.676666667

ANOVA

Source of Variation SS df MS F P-value F crit

Between Groups 16.64 2 8.32 14.37236084 5.59715E-06 3.123907449

Within Groups 41.68 72 0.578888889

Total 58.32 74

SS(Tr) SSE SST

k-1

n-k F statistic

MS(Tr)

MSE



Today’s Problem

• 95% confidence interval between scores for

poster 1 and 2:

= 0.69 to 1.55

• Since 0 is not in the interval, we conclude that

this pair of means differ. We further conclude

that mean score for poster 2 is significantly

higher than poster 1.

)11

(t)(12

12nn

MSEXX



Learning Outcomes

• What is Analysis of Variance (ANOVA)?

• Characteristics of the F Distribution

• Test for Equal Variance (single factor

ANOVA test)

• Underlying Assumptions for ANOVA

• Confidence Interval for the Difference

Between Two Means

Career

Statistical Methods for Engineering