Upload
patrick-watson
View
229
Download
6
Tags:
Embed Size (px)
Citation preview
Chapter 1 & 3
The Role of Statistics&
Graphical Methods for Describing Data
Statisticsthe science of collecting, analyzing, and drawing conclusions from data
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year.
We could collect data from all high schools in the nation.
What term would be used to describe “all
high school graduates”?
PopulationThe entire collection of
individuals or objects about which information is desired
A census is performed to gather about the entire population
What do you call it when you collect data about the
entire population?
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year.
We could collect data from all high schools in the nation.
Why might we not want to use a census here?
If we didn’t perform a census, what would we do?
SampleA subset of the population,
selected for study in some prescribed manner
What would a sample of all high school graduates across the nation look like?
A list created by randomly selecting the GPAs of all high school graduates from each state.
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year.
We could collect data from a sample of high schools in the nation.
Once we have collected the data, what would we do with it?
Descriptive statistics the methods of organizing &
summarizing data
• Create a graph
If the sample of high school GPAs contained 10,000 numbers, how could the data be described or summarized?
• State the range of GPAs• Calculate the average GPA
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year.
We could collect data from a sample of high schools in the nation.Could we use the data from this sample to answer our question?
Inferential statistics involves making generalizations
from a sample to a populationBased on the sample, if the average GPA for high school graduates was 3.0, what generalization could be made?
The average national GPA for this year’s high school graduate is approximately 3.0.
Could someone claim that the average GPA for SHS graduates is 3.0?
No. Generalizations based on the results of a sample can only be made back to the population from which the sample came from.
Be sure to sample from the population of interest!!
Variable any characteristic whose value may change from one individual to another
Is this a variable . . .The number of wrecks per week
at the intersection outside?
Dataobservations on single variable or simultaneously on two or more variables
For this variable . . .The number of wrecks per week at the
intersection outside . . . What could observations be?
Types of variables
Categorical variablesor qualitativeidentifies basic
differentiating characteristics of the population
Numerical variablesor quantitative observations or measurements
take on numerical valuesmakes sense to average these
valuestwo types - discrete & continuous
Discrete (numerical)
listable set of valuesusually counts of items
Continuous (numerical)
data can take on any values in the domain of the variable
usually measurements of something
Classification by the number of variablesUnivariate - data that describes a single
characteristic of the population
Bivariate - data that describes two characteristics of the population
Multivariate - data that describes more than two characteristics (beyond the scope of this course
Identify the following variables:1. the appraised value of homes in Fort Smith
2. the color of cars in the teacher’s lot
3. the number of calculators owned by students at your school
4. the zip code of an individual
5. the amount of time it takes students to drive to school
Discrete numerical
Discrete numerical
Continuous numerical
Categorical
Categorical
Is money a measurement or a count?
Graphs for categorical data
Bar Graph
Used for categorical data Bars do not touch Categorical variable is typically on the horizontal
axis To describe – comment on which occurred the
most often or least often May make a double bar graph or segmented bar
graph for bivariate categorical data sets
Using class survey data:
graph birth month
graph gender & favorite ice cream
Pie (Circle) graph
Used for categorical data To make:
– Proportion 360°
– Using a protractor, mark off each part
To describe – comment on which occurred the most often or least often
Graphs for numerical data
Dotplot
Used with numerical data (either discrete or continuous)
Made by putting dots (or X’s) on a number line
Can make comparative dotplots by using the same axis for multiple groups
Distribution Activity . . .
Types (shapes)of Distributions
Symmetricalrefers to data in which both sides are
(more or less) the same when the graph is folded vertically down the middle
bell-shaped is a special type
–has a center mound with two sloping tails
Uniformrefers to data in which every
class has equal or approximately equal frequency
Skewed (left or right)refers to data in which one
side (tail) is longer than the other side
the direction of skewness is on the side of the longer tail
Bimodal (multi-modal)refers to data in which two
(or more) classes have the largest frequency & are separated by at least one other class
How to describe a numerical,
univariate graph
What strikes you as the most distinctive difference among the distributions of exam scores in classes A, B, & C ?
1. Centerdiscuss where the middle of
the data fallsthree types of central
tendency–mean, median, & mode
What strikes you as the most distinctive difference among the distributions of scores in
classes D, E, & F?
2. Spreaddiscuss how spread out the data
isrefers to the variability of the
data–Range, standard deviation, IQR
What strikes you as the most distinctive difference among the distributions of exam scores in classes G, H, & I ?
3. Shaperefers to the overall shape of
the distributionsymmetrical, uniform,
skewed, or bimodal
What strikes you as the most distinctive difference among the distributions of exam scores in class K ?
4. Unusual occurrencesoutliers - value that lies away
from the rest of the datagapsclustersanything else unusual
5. In contextYou must write your answer
in reference to the specifics in the problem, using correct statistical vocabulary and using complete sentences!
Describing Quantitative Data
Histograms
Stem and Leaf Plots
Dotplots Boxplots
Just
CUSS and
BS!
Center“the typical value”
MedianMean
Gaps
Outliers
Unusual Features
Shapesingle vs. multiple
modes(unimodal, bimodal)
symmetry vs. skewness
Illustrated Distribution Shapes
Unimodal Bimodal Multimodal
Symmetric Skew positively(right)
Skew negatively(left)
Spread“how tightly values cluster around the
center”
Range
Standard deviation
5-number summary
IQR
And Be Specific!
More graphs for numerical data
Stemplots (stem & leaf plots)
Used with univariate, numerical data Must have key so that we know how to read
numbers Can split stems when you have long list of
leaves Can have a comparative stemplot with two
groups
Would a stemplot be a good graph for the number of pieces of gum chewed per day by
AP Stat students? Why or why not?
Would a stemplot be a good graph for the number of pairs of shoes owned by AP Stat
students? Why or why not?
Basic Stemplots
A stemplot is quite similar to a dotplot. Like the dotplot, the stemplot is arranged
along a type of number line (stems). Instead of plotting dots above
corresponding points, you place (in order) leaves above the points.
Let’s consider how it looks in an example.
Basic Stemplots
Consider the dataset comprised of people’s ages at a family reunion.
Their ages are:
4, 6, 7, 13, 16, 17, 23, 31, 36, 40, 42, 44, 53, 57, 58, 62, 84
Basic Stemplots
A dotplot would be cumbersome and too spread out to be informative (consider numbering off the number line from 4 to 84).
A stemplot will group the data together, thus compacting the graph and making it easier to visualize the data.
Basic Stemplots To create the stemplot,
first determine the stems. The stems in this case
should be the tens place (the values in the tens place range from 0 to 8.
Create your “number line” from 0 to 8.– This is typically done
vertically.
0
1
2
3
4
5
6
7
8
4, 6, 7, 13, 16, 17, 23, 31, 36, 40, 42, 44, 53, 57, 58, 62, 84
Basic Stemplots
Then, place the leaves next to the corresponding stems in sequential order.
Each digit placed as a stem should take up the same amount of space to provide a visual sense of how many values fall in that area of the number line.
0 4 6 7
1 3 6 7
2 3
3 1 6
4 0 2 4
5 3 7 8
6 2
7
8 4
4, 6, 7, 13, 16, 17, 23, 31, 36, 40, 42, 44, 53, 57, 58, 62, 84
Things to note include describing the center, shape, spread, and extreme values (outliers) of the distribution.– The center seems to be
around the low 40s.– The shape is a bit odd;
certainly not bell-shaped as it seems to have two peaks. This is typically referred to as bimodal.
– The spread is what you might expect for data representing ages of humans.
– It seems like the 84 year-old may be an outlier.
0 4 6 7
1 3 6 7
2 3
3 1 6
4 0 2 4
5 3 7 8
6 2
7
8 4
Advanced Stemplots Many times, the data does not lend itself to tens digits and units digits (in fact, it rarely does).
– In most cases you will need to do some rounding or truncating (cutting off the excess digits for the purpose of the graph) of the figures (see example 1).– Also, you may need split the stems.
• This means you may have two stems for each leading digit. The first stem will contain leaves from 0 to 4 and the second stem will contain leaves from 5 to 9.• See the example 2 below for clarification.
– Finally you may want to construct back-to-back stemplots in order to compare two distribution.• See example 3.
Example 1 - Rounding Original Data:
2.234, 3.23525, 3.76447, 3.794, 4.252, 4.8886
Revised Data (after rounding):2.2, 3.2, 3.8, 3.8, 4.3, 4.9
Stemplot (done in StatCrunch – plus it rounds for you!)
2 : 23 : 2884 : 39
Example 2 – Split Stems Original Data:
1.1, 1.2, 1.4, 1.6, 1.6, 1.7, 1.9, 1.9, 2.0, 2.0, 2.3, 2.5, 2.7, 3.5
1 : 12466799 1L : 1242 : 00357 OR 1H : 667993 : 5 2L: 003
2H : 573L : 3H : 5
Which of the above plots is more telling with regards to the shape, center, spread, and extremes of the distribution?
Example 3 – Back-to-Back Stemplots
Original data consists of two datasets. Consider the number of home runs hit by Barry Bonds and Mark McGwire from the years 1987 to 2001.
Year Bonds McGwire
1987 25 49
1988 24 32
1989 19 33
1990 33 39
1991 25 22
1992 34 42
1993 46 9
1994 37 9
1995 33 39
1996 42 52
1997 40 58
1998 37 70
1999 34 65
2000 49 32
2001 73 29
To construct the back-to-back stemplot, the stems go down the middle.
On the left-hand side, plot one of your distributions data points (see McGwire’s data on the left).
Notice the ordering of the leaves on McGwire’s side of the stemplot. On the right-hand side, plot your regular stemplot. This plot is useful in comparing distributions. It can quickly give the
educatied reader a sense of how the center, shape, extremes and spread of two distributions compare.
Example:
The following data are price per ounce for various brands of dandruff shampoo at a local grocery store.
0.32 0.21 0.29 0.54 0.17 0.28 0.36 0.23
Can you make a stemplot with this data?
Example: Tobacco use in G-rated Movies
Total tobacco exposure time (in seconds) for Disney movies:223 176 548 37 158 51 299 37 11 165 74 9 2 6 23 206 9
Total tobacco exposure time (in seconds) for other studios’ movies:205 162 6 1 117 5 91 155 24 55 17
Make a comparative stemplot.
Histograms
Used with numerical data Bars touch on histograms Two types
– Discrete• Bars are centered over discrete values
– Continuous• Bars cover a class (interval) of values
For comparative histograms – use two separate graphs with the same scale on the horizontal axis
Would a histogram be a good graph for the fastest speed driven by AP Stat students?
Why or why not?
Would a histogram be a good graph for the number of pieces of gun chewed per day by
AP Stat students? Why or why not?
50 students were asked the question, “How many textbooks did you purchase last
term?”
Number of Textbooks Frequency
Relative Frequency
1 or 2 4 0.083 or 4 16 0.325 or 6 24 0.487 or 8 6 0.12
“How many textbooks did you purchase last term?”
0.00
0.10
0.20
0.30
0.40
0.50
0.60
1 or 2 3 or 4 5 or 6 7 or 8
# of Textbooks
Pro
por
tion
of
Stu
den
ts
The largest group of students bought 5 or 6 textbooks with 3 or 4 being the next largest frequency.
Another version with the scales produced differently.
When working with continuous data, the steps to construct a histogram are
1. Decide into how many groups or “classes” you want to break up the data. Typically somewhere between 5 and 20. A good rule of thumb is to think having an average of more than 5 per group.*
2. Use your answer to help decide the “width” of each group.
3. Determine the “starting point” for the lowest group.
*A quick estimate for a reasonable number
of intervals is number of observations
The table below lists the number of siblings for this year’s statistics students.
2 0 3 2 2 3 1 2
3 3 2 1 2 2 1 1
3 3 1 2 2 2 5 5
2 1 0 0 1 2 3 7
3 3 5 2 1 0 1 1
1 1 1 6 1 2 1 3
1 4 8 2 0 1 2 2
2 1 2 2
1
1
7
8
16
35
14
103
202
181
50
Frequency
TalliesInterval
8 5 4.2 2.89 4.9 1.5 6 6
8 4.6 8 3 4 7 15 0.5
10 1 5.38 2.6 1 2 9 5
10 6.4 3.3 15 5 3.6 5.5
5 8 1 5.7 2 4.3 3
8 10 6.24 7 3 6.2 3.13
7 15 10 13 5 15 13.2
10 7 6.6 6 0.2 15 10
This table lists the distances statistics student drive to school.
514 to 16
212 to 14
610 to 12
68 to 10
116 to 8
134 to 6
102 to 4
60 to 2
FrequencyTalliesInterval
The two histograms below display the distribution of heights of gymnasts and the distribution of heights of female basketball players. Which is which? Why?
Heights – Figure A
Heights – Figure B
Suppose you found a pair of size 6 shoes left outside the locker room. Which team would you go to first to find the owner of the shoes? Why?
Suppose a tall woman (5 ft 11 in) tells you see is looking for her sister who is practicing with a gym. To which team would you send her? Why?
Cumulative Relative Frequency Plot(Ogive)
. . . is used to answer questions about percentiles. Percentiles are the percent of individuals that are
at or below a certain value. Quartiles are located every 25% of the data. The
first quartile (Q1) is the 25th percentile, while the third quartile (Q3) is the 75th percentile. What is the special name for Q2?
Interquartile Range (IQR) is the range of the middle half (50%) of the data.
IQR = Q3 – Q1
» This table shows weights of 79 randomly selected students.
Class Interval FrequencyRelative Frequency
100 to <115 2 0.025115 to <130 10 0.127130 to <145 21 0.266145 to <160 15 0.190160 to <175 15 0.190175 to <190 8 0.101190 to <205 3 0.038205 to <220 1 0.013220 to <235 2 0.025235 to <250 2 0.025
79 1.000
Mark the boundaries of the class intervals on a horizontal axis
Use frequency or relative frequency on the vertical scale.
Another version of a frequency table and histogram for the weight data with a class width
of 20.
Class Interval FrequencyRelative Frequency
100 to <120 3 0.038120 to <140 21 0.266140 to <160 24 0.304160 to <180 19 0.241180 to <200 5 0.063200 to <220 3 0.038220 to <240 4 0.051
79 1.001
The resulting histogram.
Yet, another version of a frequency table and histogram for the weight data with a
class width of 20.
Class Interval FrequencyRelative Frequency
95 to <115 2 0.025115 to <135 17 0.215135 to <155 23 0.291155 to <175 21 0.266175 to <195 8 0.101195 to <215 4 0.051215 to <235 2 0.025235 to <255 2 0.025
79 0.999
The corresponding histogram.
Cumulative Relative Frequency TableIf we keep track of the proportion of that data that falls below the upper boundaries of the classes, we have a cumulative relative frequency table.
Class Interval
Relative Frequency
Cumulative Relative
Frequency100 to < 115 0.025 0.025115 to < 130 0.127 0.152130 to < 145 0.266 0.418145 to < 160 0.190 0.608160 to < 175 0.190 0.797175 to < 190 0.101 0.899190 to < 205 0.038 0.937205 to < 220 0.013 0.949220 to < 235 0.025 0.975235 to < 250 0.025 1.000
If we graph the cumulative relative frequencies against the upper endpoint of the corresponding interval, we have a cumulative relative frequency plot.
Histograms with uneven class widths For many reasons, either for convenience or because that is
the way data was obtained, the data may be broken up in groups of uneven width as in the following example referring to the student ages.
Class Interval FrequencyRelative
Frequency18 to <20 26 0.32920 to <22 24 0.30422 to <24 17 0.21524 to <26 4 0.05126 to <28 1 0.01328 to <40 5 0.06340 to <50 2 0.025
If a frequency (or relative frequency) histogram is drawn with the heights of the bars being the frequencies (relative frequencies), the result is distorted. Notice that it appears that there are a lot of people over 28 when there is only a few.
Class Interval FrequencyRelative
Frequency18 to <20 26 0.32920 to <22 24 0.30422 to <24 17 0.21524 to <26 4 0.05126 to <28 1 0.01328 to <40 5 0.06340 to <50 2 0.025
To correct the distortion, we create a density histogram. The vertical scale is called the density and the density of a class is calculated by
density = rectangle heightrelative frequency of class =
class width
This choice for the density makes the area of the rectangle equal to the relative frequency.
Continuing this example we have
Class Interval FrequencyRelative
Frequency Density18 to <20 26 0.329 0.16520 to <22 24 0.304 0.15222 to <24 17 0.215 0.10824 to <26 4 0.051 0.02626 to <28 1 0.013 0.00728 to <40 5 0.063 0.00540 to <50 2 0.025 0.003
.329/2
.063/12
The resulting histogram is now a reasonable representation of the data.