Upload
derrick-king
View
219
Download
3
Embed Size (px)
Citation preview
Chapter 8Statistics
Statistics
Statistics deals with the collection and analysis of data to solve real-world problems.
Era of informationHuge datasets
Transaction data
Customer behavior
Human genome
Satellite photographs
Demographics
…………………………..
Data mining Data reduction
Multidisciplinary statistics methodology
Elements of Statistics
Statistics
Data Collection
Data Analysis
Descriptive StatisticsAnd Statistical Graphics
Statistical Inference
Survey Sampling
Experimental Design
Observational Study
•Estimation•Testing Hypothesis
Descriptive Statistics
Year Tutorial HW Quiz Midterm Final Overall Grade 3 A 9.76 9.7 94 88 91.66 A+ 3 A 9.66 8.7 88 93 91.26 A+ 1 B 9.62 9.2 76 94 88.62 A 2 A 9.50 9.0 69 93 85.70 A- 3 A 9.46 5.3 90 87 85.26 A- 1 B 9.80 8.2 70 92 85.00 A- 1 A 9.18 7.4 88 82 83.98 A- 1 A 9.54 8.4 85 79 82.94 B+ 2 A 8.94 8.1 77 85 82.64 B+ 3 A 9.74 7.5 75 85 82.24 B+ 2 B 9.24 7.1 76 85 81.64 B+ 1 A 9.40 7.9 64 90 81.50 B+ 3 A 9.84 7.6 85 77 81.44 B+ 2 B 8.98 7.5 66 88 80.28 B+ 1 B 9.64 8.3 70 82 79.94 B+ 3 A 9.72 7.7 55 91 79.42 B+ 1 B 9.74 6.9 52 94 79.24 B+ 1 B 9.82 8.0 73 79 79.22 B+ 3 B 9.52 7.6 71 80 78.42 B 3 A 8.60 7.3 66 85 78.20 B 1 B 9.54 7.0 69 81 77.74 B 3 A 7.72 7.4 81 76 77.42 B
…………………………………..
…………………………………..
Data: a set of numbers representing characteristics of observations
Different types of Data• Categorical (可分類的 )
– Nominal (無序列性的 )
– Ordinal (有序列性的 )
– Scale (有序列性且其數值有意義 )
• Continuous (連續的 )
Another concern:
• Whether the data are dependent or independent with each other. E.g. Time series.
Example 1
If the class teacher conducts a survey on the favorite fruit in class and asks the following question:
Please select you favorite fruit among the following choices. (Select one only.)
Apple Orange Banana Mango Others
In this case, we will collect nominal categorical data, as the data are categorized and the ordering of the answers is not meaningful.
In this case, we will collect nominal categorical data, as the data are categorized and the ordering of the answers is not meaningful.
Examples of different types of data
Example 2
In a questionnaire, the following question is stated:
Do you agree that all chickens should be killed in order to prevent the outbreak of H5N1 virus?
Agree Neutral Not agree
We will collect ordinal categorical data in this case, as the data are categorized and the ordering of the answers is meaningful.
We will collect ordinal categorical data in this case, as the data are categorized and the ordering of the answers is meaningful.
Example 3
At the end of questionnaires, usually we will see questions about the personal details, such as income:
Please select the range of your monthly income? Below $5000 $5001- $7000 $7001- $9000 $9001- $11000 Above $11001
Here, we have scale categorical data in this case, as the data are categorized and the values of the answers are meaningful.
Here, we have scale categorical data in this case, as the data are categorized and the values of the answers are meaningful.
Example 4
If the PE teacher conducts a survey on the physical status of students, the following question may be asked:
What is your height? Answer: cm
Here, we obtain continuous data. Furthermore, the data in this case are independent with each other.
Here, we obtain continuous data. Furthermore, the data in this case are independent with each other.
Example 5Here are the water consumptions of a household in between April 1999 and May 2001
Month Water consumptions (m3)Apr 99 4.5Aug 99 3.5Dec 99 5.5Apr 00 5.5Aug 00 8.5Dec 00 7.0Apr 01 5.0
This is a time series. In fact, this is a particular type of continuous data, which has data observed in series.
This is a time series. In fact, this is a particular type of continuous data, which has data observed in series.
Frequency Distribution
Summary Statistics for Discrete Variables Grade Count CumCnt Percent CumPct A+ 2 2 2.25 2.25 A 1 3 1.12 3.37 A- 4 7 4.49 7.87 B+ 11 18 12.36 20.22 B 8 26 8.99 29.21 B- 7 33 7.87 37.08 C+ 16 49 17.98 55.06 C 14 63 15.73 70.79 C- 11 74 12.36 83.15 D 13 87 14.61 97.75 F 2 89 2.25 100.00 N= 89
Vehicles Frequency Percentage
Cars 45 59
Lorries 22 29
Motorcycles 6 8
Buses 3 4
Total 76 100
Table: Flow of vehicles
Table: Grade of Students
Commonly used graphical displays
• Pictogram
• Bar chart
• Pie chart
• Histogram
• Broken line graph
Pictogram
Each figure represent 5 persons
Number of persons enjoy the three entertainments
Magic
Movie
Concert
Grouped Bar chart
Population of different age groups
0
100
200
300
400
500
600
700
20-24 25-29 30-34 35-39 40-44 45-49
Age
Popu
latio
n (in
thou
sand
)
1991
2001
1991 200120-24 430 40025-29 580 44030-34 600 49035-39 590 63040-44 400 65045-49 240 530
Stacked Bar Chart
Revenue and Expenditure in 2000-2001
0
5000
10000
15000
20000
25000
30000
35000
40000
Revenue Expenditure
$ M
illio
n
Revenue $ Million Expenditure $ MillionDirect Taxes 13816 Social Services 13661Indirect Taxes 8443 Community Services 3902Other Revenue 14084 General Services 13262
Economic Services 1218Security Services 4859
Pie chart
Grade A 2.70%Grade B 6.80%Grade C 14.30%Grade D 22.10%Grade E 21.60%Grade F 16.90%Unclassified 15.60%
Histogram**** Area represent frequency ****
INCORRECT
Broken line graphSales of a record company
0
5000
10000
15000
20000
25000
Jan Feb Mar Apr May June
Month
Num
ber o
f CD
sol
d
Jan 19500Feb 23000Mar 21000Apr 20000May 15000June 9000
Revenue and Expenditure of HK Government from 1979-80 to 1984-85
0
5000
10000
15000
20000
25000
30000
35000
40000
79-80 80-81 81-82 82-83 83-84 84-85
Year
$ M
illio
n
RevenueExpen d i tu r e
Example 1
If the class teacher conducts a survey on the favorite fruit in class and asks the following question:
Please select you favorite fruit among the following choices. (Select one only.)
Apple Orange Banana Mango Others
Pie chart / Bar chart / PictogramPie chart / Bar chart / Pictogram
What graph(s) is/are suitable?
Example 2
In a questionnaire, the following question is stated:
Do you agree that all chickens should be killed in order to prevent the outbreak of H5N1 virus?
Agree Neutral Not agree
Bar chartBar chart
Example 3
At the end of questionnaires, usually we will see questions about the personal details, such as income:
Please select the range of your monthly income? Below $5000 $5001- $7000 $7001- $9000 $9001- $11000 Above $11001
Bar chartBar chart
Example 4
If the PE teacher conducts a survey on the physical status of students, the following question may be asked:
What is your height? Answer: cm
HistogramHistogram
Example 5Here are the water consumptions of a household in between April 1999 and May 2001
Month Water consumptions (m3)Apr 99 4.5Aug 99 3.5Dec 99 5.5Apr 00 5.5Aug 00 8.5Dec 00 7.0Apr 01 5.0
Broken line graphBroken line graph
A. Frequency Distribution
Data that have not been organized in any way are called raw data.
When summarizing large masses of raw data, it is often useful to distribute the data into classes or categories and to determine the number of individuals belonging to each class, called the class frequency. A tabular arrangement of data by classes together with the corresponding class frequencies is called a “frequency distribution” or “frequency table”.
Frequency Table
Example:Heights of 100 Male Students at XYZ University
Height(inches)
Number ofStudents
60-62 5
63-65 18
66-68 42
69-71 27
72-74 8
Total = 100
B. Graphical Representation of Frequency Distribution
(1) Histogram(2) Frequency Polygon and Curve(3) Cumulative Frequency Polygon and Curve
The Three Averages:
C. Measures of Central Tendency
Arithmetic Mean
Median
Mode
1) Arithmetic Mean (I) For ungrouped data
n
xxxx
n
xx n
n
ii
3211
n
xxxx
n
xx n
n
ii
3211
Let be the n ungrouped data. Then, the arithmetic mean ( ) is given by:
nxxxx ,,,, 321 x
Arithmetic Mean =
Sum of ALL dataNo. of data
Arithmetic Mean =
Sum of ALL dataNo. of data
(ii) For grouped data
47.5131830291054
)9(1)8(3)7(18)6(30)5(29)4(10)3(5)2(4
321
332211
1
1
n
nnn
ii
n
iii
ffff
xfxfxfxf
f
xfx
E.g. Find the mean no. of potatoes per plant given the following frequencies of occurrence.
x
No. of potatoes 2 3 4 5 6 7 8 9
No. of plants 4 5 10 29 30 18 3 1
x
5.5012162423187
)5.62(12)5.57(16)5.52(24)5.47(23)5.42(18)5.37(7
x
If class intervals are given, we have to assign the class mid-point to each of the intervals
E.g. Find the mean of the following distribution.
Class Frequency (f) Class mid-point (x)
35 – 40 7
Over 40 – 45 18
Over 45 – 50 23
Over 50 – 55 24
Over 55 – 60 16
Over 60 – 65 12
Class Frequency (f) Class mid-point (x)
35 – 40 7 37.5
Over 40 – 45 18 42.5
Over 45 – 50 23 47.5
Over 50 – 55 24 52.5
Over 55 – 60 16 57.5
Over 60 – 65 12 62.5
2) Median
The median is the middle value of a group of data when they are arranged in order of magnitude.
Xi : 10 12 7 30 15
7 10 12 15 30 12Median
Rank
E.g.
3) Mode
The mode is the datam that occurs most frequently in a set of data.
x
No. of potatoes 2 3 4 5 6 7 8 9
No. of plants 4 5 10 29 30 18 3 1
Mode = 6
E.g.
E.g.The histogram below shows the results of an experiment in which 140 batteries are tested to determine their lifetimes.
Lifetimes of 140 batteries
Fre
qu
en
cy
50
40
30
20
10
0
Number of hours
11.5 13.5 15.5 17.5 19.5
(a) Find the median of the lifetimes of the batteries.
(b) Find, correct to 2 decimal places, the mean lifetime of the batteries.
(c) Find the probability that a battery chosen at random from these 140 batteries would have a lifetime of at least 18.5 hours.
(a) Find the median of the lifetimes of the batteries.
Since the total number of data is 140, therefore we have to find the straight line which cuts the data in the middle position, i.e. in between the 70th and the 71st data.
Lifetimes of 140 batteries
Fre
qu
en
cy
50
40
30
20
10
0
Number of hours
11.5 13.5 15.5 17.5 19.5
1620
34
42
28
Median
= 15.5+17.52
= 16.5 hours
Lifetimes of 140 batteries
Fre
qu
en
cy
50
40
30
20
10
0
Number of hours
11.5 13.5 15.5 17.5 19.5
1620
34
42
28
(b) Find, correct to 2 decimal places, the mean lifetime of the batteries.
Mean
= sum of fxsum of f
Class mark(x) Frequency(f) fx
11.5 16 18413.5 20 27015.5 34 52717.5 42 73519.5 28 546
140 2262
Sum of f
Sum of fx
= 2262140
= 16.16 hours (2 d.p.)
(c) Find the probability that a battery chosen at random from these 140 batteries would have a lifetime of at least 18.5 hours.
Class mark(x) Frequency(f)
11.5 1613.5 2015.5 3417.5 4219.5 28
Pr (lifetime of at least 18.5 hrs)
= No. of batteries have lifetime at least 18.5 hrsTotal no. of batteries
Class interval is18.5 –20.5
= 28 = 0.2140
The Arithmetic Mean1. It requires all the given data in its calculation
2. It can be easily affected by extreme values,
thus may be very misleading
3. It is often used in further statistical calculations
Arithmetic Mean =
Sum of ALL dataNo. of data
Arithmetic Mean =
Sum of ALL dataNo. of data
Here, we have two sets of data:
Data 1: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Data 2: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000}
Then,
Arithmetic Mean of Data 1 = 1
Arithmetic Mean of Data 2 = 10000.9
Here, we have two sets of data:
Data 1: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Data 2: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000}
Then,
Arithmetic Mean of Data 1 = 1
Arithmetic Mean of Data 2 = 10000.9
Arithmetic Mean has a lot of nice properties and hence many advanced
statistical tools are developed based on it.
Arithmetic Mean has a lot of nice properties and hence many advanced
statistical tools are developed based on it.
The Median1. It requires only the middle datam or data in it
s calculation
2. It is not affected by extreme values
3. It is seldom used in further statistical calculations
Median is the middle value of a group of data when they are arranged in order of magnitude
Median is the middle value of a group of data when they are arranged in order of magnitudeData 1: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Data 2: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000}
Then,
Median of Data 1 = 1
Median of Data 2 = 1
Data 1: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Data 2: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000}
Then,
Median of Data 1 = 1
Median of Data 2 = 1
The topic of Ordered Statistics is very complicated.
The topic of Ordered Statistics is very complicated.
The Mode
1. It is easy to understand and convenient to use.
2. It is not affected by extreme values
3. There may be more than one mode in a distribution.
Mode is the value with highest frequencyMode is the value with highest frequency
Data 1: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Data 2: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000}
Then,
Mode of Data 1 = 1
Mode of Data 2 = 1
Data 1: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Data 2: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000}
Then,
Mode of Data 1 = 1
Mode of Data 2 = 1
4. It is rarely used in further statistical calculations
Points to note when using the averages
1. The measures should be sensibly used. (i.e., easy to interpret and not misleading)
2. Usually, MEAN is useful and not misleading, however when the distribution of data is highly skewed (i.e., extreme value existed), then we better use MEDIAN.
3. We use MODE only when no further statistical analysis is needed.