Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
M140: Sampling, Relationships and
Plotting Data
Dr Jason [email protected]
07311 188800
This tutorial will begin at 10am and will last for approximately an hour.
This tutorial will be recorded. Please let me know if you have any questions or concerns about this.
Things you might need for this tutorial:• M140 Computer Book & Book 2• Pen, paper & calculator• Drink of your choice
Don’t forget to set up your audio using the Audio Wizard (in the ‘Meeting Menu’). Some headsets have independent volume controls so you may need to adjust these too.
You will also need to set up your mic if you plan on using it. Clicking the Mic symbol at the top of you Adobe Connect Window will toggle it on/off.
Not connected Connected and live Connected and muted
iCMA41 due on 2 December!
Good morning!
2
Mics will be muted until towards the end of the tutorial, when I will also stoprecording.Do use the Chat Box if you have a question during the tutorial!I will email slides out after the tutorial.
Tutorials are enhanced by your interactionPlease vote in the polls, ask questions and work through the exercises
Feel free to ask any questions or provide feedback by emailafterwards, or use the Private Chat function if you prefer during thetutorial
Sampling, Relationships and Plotting Data
• Minitab – generating lists of random numbers• Uniform, Normal distributions
• Sampling Methods• Simple random sampling - Minitab• Systematic random sampling• Stratified sampling – Minitab• Cluster sampling
• Exploring Relationships• Visual – scatterplots in Minitab• Least Squares Regression in Minitab
3
Computer Book has detailed instructions!
Generating Random Numbers
4
True or false?Computers cannot generate
true random numbers
Generating Random Numbers
• Computers generate pseudo-randomnumbers
• Given sufficient time, patterns will emerge and distribution will become different from true random
• This is bad for strong encryption
• You can force Minitab to generate the same ‘random’ numbers each time, by specifying a base or a seed value
• This is only useful if you want someone else to get the same random values as you
5
Generating Random Numbers
• Computers generate pseudo-randomnumbers
• Given sufficient time, patterns will emerge and distribution will become different from true random
• This is bad for strong encryption
• You can force Minitab to generate the same ‘random’ numbers each time, by specifying a base or a seed value
• This is only useful if you want someone else to get the same random values as you
• True random number sources• Random number tables• Dice or well-shuffled cards• Physical phenomena such as
radioactive decay• Lava lamps!
6
Generating Random Numbers
• Computers generate pseudo-randomnumbers
• Given sufficient time, patterns will emerge and distribution will become different from true random
• This is bad for strong encryption
• You can force Minitab to generate the same ‘random’ numbers each time, by specifying a base or a seed value
• This is only useful if you want someone else to get the same random values as you
• True random number sources• Random number tables• Dice or well-shuffled cards• Physical phenomena such as
radioactive decay• Lava lamps!
• For our purposes and most scientific uses, computer-generated numbers are fine
7
Random Numbers With Minitab 1
8
Which is the Uniformdistribution and which is the Normal distribution?
Random Numbers With Minitab 1
9
Uniform Normal or Gaussian
Every number has an equal chance of occurring, perfect for selecting samples
Only numbers which fit a Normal distribution are used, which are biased
towards the mean (0)
Random Numbers With Minitab 2
1. Create a column(s) to receive the random numbers
10
Random Numbers With Minitab 3
1. Create a column(s) to receive the random numbers
2. Select Calc -> Random Dataand select your distribution• Uniform for regular random
numbers
11
Random Numbers With Minitab 4
1. Create a column(s) to receive the random numbers
2. Select Calc -> Random Dataand select your distribution• Uniform for regular random
numbers
3. Specify the receiving column, number of rows and parameters
12
Random Numbers With Minitab 5
1. Create a column(s) to receive the random numbers
2. Select Calc -> Random Dataand select your distribution• Uniform for regular random
numbers3. Specify the receiving column,
number of rows and parameters
4. Format the receiving column if necessary, e.g. specify dp
13
Sampling Theory & Minitab Practice
14
How Much Is This Forest Worth?
15
• A farmer wants to know how much the trees are worth in his forest• We can’t measure every tree to determine its value… so how can we answer?
How Much Is This Forest Worth?
16
• A farmer wants to know how much the trees are worth in his forest• We can’t measure every tree to determine its value… so how can we answer?
Sampling!
How Much Is This Forest Worth?
17
How Much Is This Forest Worth?
18
How Much Is This Forest Worth?
19
Exploratory Data Analysis (EDA)Tally
• Minitab function to count different values in a column
• Numeric or nominal data
20
Exploratory Data Analysis (EDA)Tally
• Minitab function to count different values in a column
• Numeric or nominal data
21
Exploratory Data Analysis (EDA)Tally
• Minitab function to count different values in a column
• Numeric or nominal data
22
Exploratory Data Analysis (EDA)Tally
• Minitab function to count different values in a column
• Numeric or nominal data
• It’s tempting here to make a back-of-envelope estimation
• This will be very rough and not suitable for our purposes
• But does provide an indication
23
Exploratory Data Analysis (EDA)Graphical Summary
• Look at age initially to understand the spread
• Older trees should be larger and more valuable
24
25
What might account for the skew in ages?
26
Exploratory Data Analysis (EDA)Graphical Summary
• Look at age initially to understand the spread
• Older trees should be larger and more valuable
• Look at age by tree species• Are trees planted in rotation or in
groups?
27
Exploratory Data Analysis (EDA)Graphical Summary
• Look at age initially to understand the spread
• Older trees should be larger and more valuable
• Look at age by tree species• Are trees planted in rotation or in
groups? The list of available columns changes because nominal data is valid to
categorise a numeric variable
28
Species Median (yr) Range (yr)
Beech 55 22-80
Birch 20 15-23
Elm 80 0.9-98
Oak 102 80-150
Yew 124 110-150
Sampling Methods 1
• Simple Random Sampling• Select n at random from a list• With or without replacement
29
30
Sampling 1 - MinitabSimple Random Sampling
• Each member is equally likely to be sampled
• Sampling does not affect the chance of selecting any other sample
31
Sampling 1 - MinitabSimple Random Sampling
• Each member is equally likely to be sampled
• Sampling does not affect the chance of selecting any other sample
• Replacement• Without: complete independence• With: may select the same datum
multiple times but may be better for small datasets
32
Sampling 1 - MinitabSimple Random Sampling
1. Create a column to accept the sample list
33
Sampling 1 - MinitabSimple Random Sampling
1. Create a column to accept the sample list
2. Open the Sample From Columns dialogue box
34
Sampling 1 - MinitabSimple Random Sampling
1. Create a column to accept the sample list
2. Open the Sample From Columns dialogue box
3. Complete required fields
35
Sampling 1 - MinitabSimple Random Sampling
1. Create a column to accept the sample list
2. Open the Sample From Columns dialogue box
3. Complete required fields• From Column will be the serial
number or index of the tree to be measured
36
Sampling 1 - MinitabSimple Random Sampling
1. Create a column to accept the sample list
2. Open the Sample From Columns dialogue box
3. Complete required fields1. From Column will be the serial
number or index of the tree to be measured
4. Click OK
Sampling Methods 2
• Systematic Random Sampling• Select a random start• Then select every nth
• Often used in industrial processes
37
Sampling Methods 2
• Systematic Random Sampling• Select a random start• Then select every nth
• Often used in industrial processes• Can be more representative than
simple random sampling• Can be less representative if the
sampling list is structured or ordered
38
39
Sampling 2 - MinitabSystematic Sampling
• Sadly we can’t do this in Minitab!• Paper, Excel or another spreadsheet
is easy
40
Sampling 2 - MinitabSystematic Sampling
1. Calculate the sampling interval:• Interval = Population size
Sample size
41
Sampling 2 - MinitabSystematic Sampling
1. Calculate the sampling interval:• Interval = Population size
Sample size
2. Select a random number as the first sample datum• Use a table or generate a Uniform
Distribution random number list
42
Sampling 2 - MinitabSystematic Sampling
1. Calculate the sampling interval:• Interval = Population size
Sample size
2. Select a random number as the first sample datum• Use a table or generate a Uniform
Distribution random number list
3. Iteratively add the interval to the prior sample index until n is reached
43
Sampling Methods 3Stratified Sampling
• There are different methods for selecting stratum size
• Distribution-matched – reflects the composition of the population (A)
• Equal size – approximately same number of members in each stratum
• Select stratum members randomly
Species Tally Percent Stratum A
Stratum B
Beech 52 26% 7.8 = 8 6
Birch 66 33% 9.9 = 10 6
Elm 41 20.5% 6.15 = 6 6
Oak 36 18% 5.4 = 5 6
Yew 5 2.5% 0.75 = 1 6
44
Sampling Methods 3Stratified Sampling
• There are different methods for selecting stratum size
• Distribution-matched – reflects the composition of the population (A)
• Equal size – approximately same number of members in each stratum
• Select stratum members randomly• Can be more representative than
random sampling• Useful method if differences
between strata is important
Species Tally Percent Stratum A
Stratum B
Beech 52 26% 7.8 = 8 6
Birch 66 33% 9.9 = 10 6
Elm 41 20.5% 6.15 = 6 6
Oak 36 18% 5.4 = 5 6
Yew 5 2.5% 0.75 = 1 6
45
Sampling 3 - MinitabStratified Sampling
• There are different methods for selecting stratum size
• Distribution-matched – reflects the composition of the population (A)
• Equal size – approximately same number of members in each stratum
• Select stratum members randomly• Can be more representative than
random sampling• Useful method if differences
between strata is important
Species Tally Percent Stratum A
Stratum B
Beech 52 26% 7.8 = 8 6
Birch 66 33% 9.9 = 10 6
Elm 41 20.5% 6.15 = 6 6
Oak 36 18% 5.4 = 5 6
Yew 5 2.5% 0.75 = 1 6
Minitab will create a stratified sample but it is fiddly. See the end of the slide pack
for a Minitab Blog article and some screenshots.
46
Sampling Methods 4Cluster Sampling
• Geographic method, best suited to sampling from multiple locations
47
Sampling Methods 4Cluster Sampling
• Geographic method, best suited to sampling from multiple locations
• Use a random method to select a small number of locations
• Divide locations into clusters if needed
48
Sampling Methods 4Cluster Sampling
• Geographic method, best suited to sampling from multiple locations
• Use a random method to select a small number of locations
• Divide locations into clusters if needed
• Choose a subsample from each of these sample locations
• Randomly!
49
Sampling Methods 4Cluster Sampling
• Geographic method, best suited to sampling from multiple locations
• Use a random method to select a small number of locations
• Divide locations into clusters if needed
• Choose a subsample from each of these sample locations
• Randomly!
• Combine
Sampling Methods
1. Avoid the use of judgement or convenience to select samples2. Use a good source of random values
1. Tables2. Computer3. Calculator4. Dice, well-shuffled deck of cards
3. Trade off between accuracy and sample size1. Sample size may be constrained e.g. by cost, practicality, access etc.
50
More to come on sample sizes
Golden Rules
Relationships Between Variables 1
• Sometimes we have multiple variables in a system• Lab experiment, data analysis, machine learning, traffic survey.. Endless!
51
Relationships Between Variables 2
• Sometimes we have multiple variables in a system• Lab experiment, data analysis, machine learning, traffic survey.. Endless!
• Scientists are often interested in whether there are relationships between variables
• Why?
52
Why do we look for relationships between
variables?
Relationships Between Variables 3
• Sometimes we have multiple variables in a system• Lab experiment, data analysis, machine learning, traffic survey.. Endless!
• Scientists are often interested in whether there are relationships between variables
• Why?
• Here are a couple of tools to help explore multiple variables• Is there a relationship between variable A and variable B?• What kind of relationship?• How strong?• Can I use this to predict variable B behaviour?
53
Relationships Between Variables 4
• What relationship would you expect the following to have?• Positive or negative?
• Petrol price and miles driven• Salt intake and blood pressure• Number of completed Unit exercises and TMA scores• Price of an item and number of that item sold• Temperature and ice cream sales
54
Relationships Between Variables - Minitab
• Tool 1: Visual exploration – scatter plot
55
Relationships Between Variables - Minitab
• Tool 1: Visual exploration – scatter plot• Tool 2: Describing & predicting – least squares regression
56
Scatterplots with Minitab 1
57
How confident are you with using scatter plots?
58
1 41 21 08642
40000
30000
20000
1 0000
0
C1
C2
Scatterplot of C2 vs C1
1 41 21 08642
20
1 8
1 6
1 4
1 2
1 0
8
C1
C3
Scatterplot of C3 vs C1
1 0987654321
7000
6000
5000
4000
3000
2000
1 000
0
C4
C5
Scatterplot of C5 vs C4
1 0987654321
20
1 5
1 0
5
0
C4C6
Scatterplot of C6 vs C4
Explanatory or predictor Explanatory or predictor
Explanatory or predictor Explanatory or predictor
Resp
onse
Resp
onse
59
1 41 21 08642
40000
30000
20000
1 0000
0
C1
C2
Scatterplot of C2 vs C1
1 41 21 08642
20
1 8
1 6
1 4
1 2
1 0
8
C1
C3
Scatterplot of C3 vs C1
1 0987654321
7000
6000
5000
4000
3000
2000
1 000
0
C4
C5
Scatterplot of C5 vs C4
1 0987654321
20
1 5
1 0
5
0
C4C6
Scatterplot of C6 vs C4
Response and Explanatory Variables 4
• Are TMA01 scores related to the total amount of time spent studying the course in weeks 1- 6?
Which is the response variable?
Response and Explanatory Variables 4
• Are TMA01 scores related to the total amount of time spent studying the course in weeks 1- 6?
• Explanatory variable: Time spent studying • Response variable: TMA01 scores
Scatterplots with Minitab 1
1. Select Graph -> Scatterplot…
62
Scatterplots with Minitab 2
1. Select Graph -> Scatterplot…2. Select Simple
63
Scatterplots with Minitab 3
1. Select Graph -> Scatterplot…2. Select Simple3. Select your X and Y variables
• Explanatory or Predictor on X• Response on Y
64
65
Line of Best Fit 1
• Sometimes a line can be fitted to a scatterplot, to help explain data more easily
• This line can also be used as a prediction tool• Machine learning!
• But which line has the best fit?
x
xxx
x
x
xx x
x x
ab
c
Line of Best Fit 2
• Graph of achievement in maths against reading, the units are the average scores for 15 year olds, by country (pisa.mtw)
• Where would you draw the regression line?
67
Regression 1
• A regression model is systematically fitted to every data point • Different methods are used to calculate the distance from many
theoretical lines to each point• Residuals
• The line with the smallest total residuals is selected
• Here we use a linear regression model and the least squares fitting method
68
Regression 2
• Any straight line can be expressed as:• 𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝐶𝐶
• 𝑚𝑚 is the gradient or slope of the line• 𝐶𝐶 is the intercept on the vertical axis
69
Regression With Minitab 1
1. Select Fit Regression Model
70
Regression With Minitab 2
1. Select Fit Regression Model2. Select the variables
• Predictor = X axis• Response = Y axis
71
Regression With Minitab 3
1. Select Fit Regression Model2. Select the variables
• Predictor = X axis• Response = Y axis
3. Click OK
72
Regression With Minitab 4
1. Select Fit Regression Model2. Select the variables
• Predictor = X axis• Response = Y axis
3. Click OK4. Here is our 𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝐶𝐶
73
Regression With Minitab 5
74
Why might this be a poor
prediction tool in some cases?
Regression With Minitab 5
75
Why might this be a poor
prediction tool in some cases?
Negative intercept suggests anything shorter than 3m has a negative value.
Regression With Minitab 6
Adding a regression line1. Select Graph -> Scatterplot2. Select With Regression
76
Regression With Minitab 7
Adding a regression line1. Select Graph -> Scatterplot2. Select With Regression3. Choose the X and Y variables
77
Regression With Minitab 8
78
Regression With Minitab 8
Residuals1. Select Scatterplot2. Select X and Y variables3. Select Graphs…
79
Regression With Minitab 9
Residuals1. Select Scatterplot2. Select X and Y variables3. Select Graphs…4. Select parameters as shown
80
Regression With Minitab 10
81
OU Resources• M140 materials online
• Course Books & Screencasts• https://learn2.open.ac.uk/course/view.php?id=2
08584&area=resources
• M140 student forums• OU Library e-books
• https://pmt-eu.hosted.exlibrisgroup.com/permalink/f/gvehrt/TN_cdi_askewsholts_vlebooks_9781846281686
• https://pmt-eu.hosted.exlibrisgroup.com/permalink/f/h21g24/44OPN_ALMA_DS51131243990002316
• Contact me:• [email protected]• 07311 188 800
Online Resources• Wikipedia• CrossValidated
• https://stats.stackexchange.com/
• Minitab channel on YouTube:• https://www.youtube.com/user/MinitabInc
• Minitab help• https://support.minitab.com/en-us/minitab/19/
82
Thank you! Any questions?
Recording will be available from M140-20J Online Tutorial Roomhttps://learn2.open.ac.uk/mod/connecthosted/view.php?id=1644077&group=274133
83
Sampling With MinitabStratified Sampling
1. Split Worksheet by tree species
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software
84
Sampling With MinitabStratified Sampling
1. Split Worksheet by tree species2. Create a random sample on
each new worksheet for the stratum size, using same destination column
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software
85
Sampling With MinitabStratified Sampling
1. Split Worksheet by tree species2. Create a random sample on
each new worksheet for the stratum size, using same destination column
3. Stack all the sub-sheets
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software
86
Sampling With MinitabStratified Sampling
1. Split Worksheet by tree species2. Create a random sample on
each new worksheet for the stratum size, using same destination column
3. Stack all the sub-sheets4. Copy the stratified sample
column to a new worksheet using Subset the Data
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software
87
Sampling With MinitabStratified Sampling
1. Split Worksheet by tree species2. Create a random sample on
each new worksheet for the stratum size, using same destination column
3. Stack all the sub-sheets4. Copy the stratified sample
column to a new worksheet using Subset the Data
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software
88
Sampling With MinitabStratified Sampling
5. Set this condition
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software
89
Sampling With MinitabStratified Sampling
5. Set this condition6. Sample will appear in new sheet
https://blog.minitab.com/blog/statistics-and-quality-improvement/taking-a-stratified-sample-in-minitab-statistical-software