Statistics 1 - Notes

  • Upload
    ryanb97

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 7/27/2019 Statistics 1 - Notes

    1/22

    Edexcel Notes S1

    1

    Liverpool F.C.

    Statistics 1

    Mathematical Model

    A mathematical model is a simplification of a real world problem.

    1. A real world problem is observed.2. A mathematical model is thought up.3. The model is used to make predictions, "What happens if...?"4. Real world data is collected.5. Predicted results are obtained.6. These are compared with statistical tests.7. Models are refined as required and then it's back to stage 3...

    Advantages of using mathematical models are:

    They simplify a real world problem. They improve our understanding of a real world problem. They are quicker and cheaper. They can be used to predict future outcomes.

    Disadvantages of using mathematical models are:

    Only give a partial description of the real problem. Only work for a restricted range of values.

    Stem and Leaf

    One of the simplest ways of ordering data and presenting data is to place it in a stem and leaf diagram.

    For example, which the following data:

    Person 1 2 3 4 5 6 7 8 9 10 11

    Weight (lb) 166 164 143 189 191 178 165 159 189 191 176

    Height (cm) 161 160 160 199 167 178 169 174 172 178 167

    Unordered Stem and Leaf Ordered Stem and Leaf

    Height in cm Height in cm3 | 4 represents 34 cm. 3 | 4 represents 34 cm.

    15 15

    16 10079 16 00179

    17 84286 16 00179

    18 18

    19 9 19 9

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    2/22

    Edexcel Notes S1

    2

    Liverpool F.C.

    As you can see from the key, the | divides tens from units. Stem and leafs can also be back to back, if

    you have two sets of data to display.

    Using the data above:

    Weight in pounds4 | 3 represents 34 lb.

    Height in cm3 | 4 represents 34 cm.

    3

    9

    7654

    8

    99

    11

    14

    15

    16 00179

    17 24688

    18

    19 9

    Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in

    this example, than height. If it were comparing something like scores on two exams, we could compare

    the median.

    Frequency Tables

    Amount (x) Frequency, (f)

    0 x < 20 5

    20 x < 40 9

    40 x < 60 20

    60 x < 80 25

    80 x < 100 9

    Cumulative Frequency

    One way we can interpret the data is by working out the cumulative frequency. This simply means add

    the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From

    the above example, we get:

    Amount (x) Frequency (f) Upper class boundary Cumulative frequency

    0 x < 20 5 20 5

    20 x < 40 9 40 14 (5+9)

    40 x < 60 20 60 34 (5+9+20)

    60 x < 80 25 80 59 (5+9+20+25)

    80 x < 100 9 100 68 (5+9+20+25+9)

    Total 68

    To check you're right for the cumulative frequency, you can add the frequency column. Or the question

    will probably say something like, "a survey of 68 people..." and that's an even easier check.

    When we have our cumulative frequency column, we can draw a cumulative frequency curve.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    3/22

    Edexcel Notes S1

    3

    Liverpool F.C.

    Using this, we can also create a box plot. This is deduced by looking at the quartiles up the y-axis and

    finding the corresponding x-values:

    Box plots are useful because they tell you lots of information, such as the Median, show you the spread

    of the IQR, if there are any outliers and whether the data is normal, positively or negatively skewed.

    The IQR is a measure of spread.

    IQR = Q - Q

    Outliers are extreme values. They are usually represented as a cross:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    4/22

    Edexcel Notes S1

    4

    Liverpool F.C.

    They can be either too low or two high and are usually worked out by the equations:

    Q1 - 1.5 x (IQR) (Anything less than this figure will be an outlier)

    Q3

    + 1.5 x (IQR) (Anything greater than this figure will be an outlier).

    The exam question will always state how to work out the outliers though, so this is one thing you don't

    have to worry about remembering (just as long as you know how to use the formula).

    When you've distinguished the outliers, where does the end of the box plot occur? You can either use

    the next highest/lowest data value after the outlier, or use the value worked from the formula.

    Linear Interpolation

    To work out the median, find the value.

    For Q1 work out the value and for Q3 find the value.

    Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the .

    For a grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of

    estimating an answer, however, and this is called linear interpolation.

    Time (sec) Frequency Cumulative Frequency Class width

    0 x < 10 0 0 10

    10 x < 15 8 8 5

    15 x < 17.5 3 11 2.5

    17.5 x < 20 7 18 2.5

    20 x < 24 12 30 4

    The first step is the find the value. In this example, it is15.5.

    We take away 11and then divide it by 7 (the frequency of the row the cumulative 15.5 is found in).

    Next we times by 2.5(the class width of the row 15.5 is found in).

    Finally add on 17.5 (the lower class boundary of the row 15.5 is found in) and the answer appears, 19.1.

    The only difference for the percentiles and other quartiles is replacing by whatever you want to find.

    Mean from frequency table

    It's easy enough to work out the mean from normal data, just the simple formula:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    5/22

    Edexcel Notes S1

    5

    Liverpool F.C.

    (In other words, add them all up and divide by the number that there is.)

    Time (sec) Frequency (f)

    0 - 9 0

    10 - 14 8

    15 - 17 318 - 20 7

    21 - 24 12

    For a grouped frequency table, you'll need to work out the mid-point of the x variable.

    Midpoint =

    The formula is:

    Therefore, once you have the midpoint, you need to multiply f and x:

    Time (sec) Frequency (f) Midpoint (x) f(x)

    0 - 9 0 4.75 0

    10 - 14 8 12 96

    15 - 17 3 16 48

    18 - 20 7 19 133

    21 - 24 12 23 276

    Add the f(x) column and then divide by the total of the Frequency column to find the mean:

    Standard Deviation

    For an ordinary set of data, the standard deviation is found by the following:

    (Variance is the same formula, but withoutthe square root).

    For a frequency table, or grouped frequency table, though, again we have a slightly different formula:

    Taking the above as an example, we need to add an f(x)2column. Be careful with this. Notice only the x

    is squared, not (fx)2.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    6/22

    Edexcel Notes S1

    6

    Liverpool F.C.

    Time (sec) Frequency (f) Midpoint (x) f(x) f(x)2

    0 - 9 0 4.75 0 0

    10 - 14 8 12 96 1152

    15 - 17 3 16 48 768

    18 - 20 7 19 133 2527

    21 - 24 12 23 276 6348

    Now add up the fx2 and fcolumns, and write in the mean squared:

    Stick all that in your calculator and you'll get the answer: 4.48 (3 sf)

    Coding

    When the numbers are too large to be reasonably worked with, there is an option for finding the mean.We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a

    smaller number).

    Use the code to calculate the mean and standard deviation of the following frequency table:

    x Frequencyf

    15.5 8

    25.5 12

    35.5 15

    45.5 16

    55.5 1165.5 6

    75.5 2

    We need to add the code column, and work out y and then add a column forf(y)andf(y)2 rather than

    f(x) andf(x)2:

    x Frequencyf f(y) f(y)2

    15.5 8 -3 -24 72

    25.5 12 -2 -24 48

    35.5 15 -1 -15 15

    45.5 16 0 0 0

    55.5 11 1 11 11

    65.5 6 2 12 24

    75.5 2 3 6 18

    Next, work out the mean ofy, using the formula:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    7/22

    Edexcel Notes S1

    7

    Liverpool F.C.

    = -0.49 (3 s.f)

    We think back to the original code:

    If we replace y with here, we can replace x with :

    Add the numbers, and rearrange to make the subject of the formula.

    = 40.6 (3 s.f.) and that's your answer!

    For standard deviation its exactly the same. Now, if we think of the dispersion, adding and subtracting

    won't affect the Standard deviation. Dividing and multiplying will, however.

    Histograms

    Histograms are used for representing data that is continuous and are summarized in a grouped

    frequency distribution.

    There are no gaps between the bars. The area of the bar is proportional to the frequency.

    Example:

    The height of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a

    histogram to represent the data.

    Height Frequencyf

    120-124 1

    124-129 5130-134 7

    135-139 4

    140-149 3

    There are two columns that we need to add: the class width and the frequency density.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    8/22

    Edexcel Notes S1

    8

    Liverpool F.C.

    Class width is the width of each group. Be careful when calculating to work out from the lower class

    boundary and the upper class boundary. For example, 120-125 is actually: 124.5-119.5 and so the class

    width is 5.

    Height Frequencyf Class Width Frequency Density

    120-124 1 5 0.2

    125-129 5 5 1

    130-134 7 5 1.4

    135-139 4 5 0.8

    140-149 3 10 0.3

    When we have these values, we plot the lower class and upper class boundaries on the x axis and the

    frequency density on the y axis.

    Skewness

    From the histogram above, we see a slight positive skew: there are more values towards the negative

    than there are towards the positive. There are three types of skew, positive, negative and normal, and

    there are three tests to differentiate between them:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    9/22

    Edexcel Notes S1

    9

    Liverpool F.C.

    Positive Skew Symmetrical Negative Skew

    Mean > Median > Mode Mean = Median = Mode Mean < Median < Mode

    Q2 - Q1< Q3 - Q2 Q2 - Q1 = Q3 - Q2 Q2 - Q1 > Q3 - Q2

    Correlation

    Correlation is a measure of relationship between two or more variable. When we have two sets of data,

    we can draw a scatter diagram to see if there is any correlation between them

    Data: The marks of 10 candidates in Maths and Physics is shown below:

    Candidate 1 2 3 4 5 6 7 8 9 10

    Physics (x) 18 20 30 40 46 54 60 80 88 92

    Maths (y) 42 54 60 54 62 68 80 66 80 100

    From the data, we can plot the x values corresponding to the y values. The only difference is that we

    don't join the crosses with a line:

    We can already see that it's positively correlated. A way to test this is to divide the graph into four

    quadrants, and then look at where the majority of the points lie:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    10/22

    Edexcel Notes S1

    10

    Liverpool F.C.

    If most points lie in the 1st and 3rd

    quadrants, we have a positive

    correlation.

    If most points lie in the 2nd and 4th

    quadrants, we have a negative

    correlation.

    If points lie in all four quadrants

    randomly, we have no correlation.

    However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the

    strength of the correlation. There's a formula for this called PMCC (Product Moment Correlation Co-

    efficient).

    How to calculate Sxy, Sxx and Syy:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    11/22

    Edexcel Notes S1

    11

    Liverpool F.C.

    From the above information, we complete the following table:

    x y x2 y

    2 xy

    18 42 324 1764 756

    20 54 400 2916 1080

    30 60 900 3600 1800

    40 54 1600 2916 2160

    46 62 2116 3844 2852

    54 68 2916 4624 3672

    60 80 3600 6400 4800

    80 66 6400 4356 5280

    88 80 7744 6400 7040

    92 100 8464 10000 9200

    x = 528 y = 666 x2

    = 34464 y2

    = 46820 xy = 38640

    If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.

    Now using the PMCC formula:

    PMCC works so that1 r 1, with -1 being perfect negative correlation, 0 being no correlation and +1

    being perfect positive correlation. 0.863 is strong positive.

    Even if we code the data, the PMCC remains the same.

    Least squares regression line

    We can work out b easily enough from the data above:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    12/22

    Edexcel Notes S1

    12

    Liverpool F.C.

    = 66.6

    = 52.8

    If the question asked you to draw on the regression line, an easy way is to plot the and point on the

    scatter diagram, and then draw the line from the y-axis point, crossing this point. The mean point always

    lies on the line.

    If the data is coded, we need to uncode when finding the mean.

    An independent (explanatory) variable is one that is set independently of the other variable. (Plotted

    on the axis).

    A dependent (response) variable is one whose values are determined by the values of the independent

    variable. (Plotted on the axis).

    Interpolation is when you estimate the value of a dependent variable within the range of the data.

    Extrapolation is when you estimate a value outside the range of the data. Values estimated by

    extrapolation can be unreliable.

    Probability

    IfA is an event, the probability of it occurring is the number of ways A can occur, divide by the samplespace (total number of outcomes, S).

    =

    Probability is always 0 p 1.

    If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find

    p(A'), we merely take p(A) away from 1.

    A B - this means A "intersection" B - all elements that are in A and in B. We can see this on a Venn

    diagram:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    13/22

    Edexcel Notes S1

    13

    Liverpool F.C.

    A B means A "union" B -- all elements that are in A or in B. On a Venn diagram this is:

    Addition Rule

    This addition rule for finding P(AB) :

    We can rearrange this to get:

    Example:

    There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hard-

    back and the remaining 9 are paper back.

    Find the probability that a hard-back fiction book is chosen at random.

    First stage is to draw a Venn diagram and write in all the numbers:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    14/22

    Edexcel Notes S1

    14

    Liverpool F.C.

    We're looking for p(H F) so where is it both H and F? Where the two circles overlap, so 4/15.

    Find the probability that a hardback is chosen but is not fiction.

    We're wanting p(H F'). Which is 2/15.

    Conditional Probability

    This occurs when the probability of A is conditional upon B having already occurred. Given B, find the

    probability of A. It's written out as p(A|B).

    We use tree diagrams to solve conditional probability.

    Example:

    A bag contains 6 red and 4 blue balls. 2 balls are picked at random and retained.

    Find the probability that both balls are red.

    First, draw out a tree diagram.

    We want p(R R), so we just follow the tree diagram along:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    15/22

    Edexcel Notes S1

    15

    Liverpool F.C.

    6/10 x 5/9 = 30/90 = 1/3.

    Find the probability that the balls are different colours.

    We want p(R B) and p(B R). Multiply across both branches and then add these together:

    p(R B) = 6/10 x 4/9 = 24/90

    p(B R) = 4/10 x 6/9 = 24/90

    = 48/90 = 8/15.

    Find the probability that the second ball is red, given the first is blue.

    We want p(R|B), so we use the formula:

    = 24/90 4/10

    = 2/3.

    Independent Events

    Independent events are the opposite of conditional, where one factor doesn't affect the next. Example,

    if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many

    times you pick from the bag.

    This means:

    If they are mutually exclusive, they cannot occur at the same time and the p(A B) is 0.

    This means that:

    Sample Space Diagram

    Example : A dice is thrown twice and the scores obtained are added together. Find the probability that

    the total score is 6.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    16/22

    Edexcel Notes S1

    16

    Liverpool F.C.

    There are 36 equally likely outcomes.

    5 of the outcomes result in a total of 6.

    First Throw

    Discrete Random Variables

    Discrete Random Variables are probabilities such as the "number on a fair die".

    The probability for discrete random variables is written as P( ).

    Example:A tetrahedral die has the numbers 1, 2, 3, 4 on its faces. The die is biased in such way that:

    P( ) = = 1,2,3

    P( ) = 3 = 4

    If we draw out this in a probability distribution table we get:

    P( )

    1

    2

    34 3

    All the probabilities added together = 1.

    (1 + 1 + 1 + 3) = 1

    6 = 1

    =

    Therefore, we can write out the probability distribution:

    P( )1

    2

    3

    4

    We can also find the cumulative distribution, the F(x):

    6 7 8 9 10 11 12

    5 6 7 8 9 10 11

    4 5 6 7 8 9 10

    3 4 5 6 7 8 9

    2 3 4 5 6 7 8

    1 2 3 4 5 6 7

    1 2 3 4 5 6

    Second Throw

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    17/22

    Edexcel Notes S1

    17

    Liverpool F.C.

    P( ) F(x)

    1

    2

    3

    4 1

    The cumulative probability always adds up to 1.

    P( ) means the probability of getting an X value less than or equal to 2. We add up the probabilities

    we have, and so, in the above example, P( ) =

    F(x) means so F(2) =

    If a question asks you something like F(3.5), in our example 3.5 doesn't exist. Therefore, we do F(3)

    instead, which would be .

    Mean and Variance

    Finding the mean and variance is almost identical to finding the mean of a frequency table.

    The formula for mean:

    For Variance, we have the formula:

    To find

    Example:

    If X is a discrete random variable.

    0 0.4 0 0

    1 0.5 0.5 0.5

    2 0.1 0.2 0.4

    0.7 0.9

    Therefore,

    Suppose is the random variable given by by coding for the above table. The table would now

    look like this:

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    18/22

    Edexcel Notes S1

    18

    Liverpool F.C.

    -2 0.4 -0.8 1.6

    1 0.5 0.5 0.5

    4 0.1 0.4 1.6

    Total 0.1 3.7

    Remember the code:

    To decode back:

    In general:

    Discrete Uniform distribution is where each random variable has the same probability. For example,

    when is the probability of a fair 6-sided die. Each probability would be .

    A Discrete Uniform distribution over the values 1,2,3,, n.

    Example: A tetrahedral dice has its faces numbered 1, 2, 3 and 4.Xis the score obtained when the dice

    is rolled.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    19/22

    Edexcel Notes S1

    19

    Liverpool F.C.

    Xtherefore has a uniform distribution, .

    = 2.5

    The Normal Distribution

    - Symmetrical about the mean.- Total area under the curve = 1- Probabilities correspond to the area.- A continuous distribution (therefore there is no difference between and

    .

    - 68% of the distribution lies within 1 standard deviation of the mean.- 95% of the distribution lies within 2 standard deviations of the mean.- 99.7% of the distribution lies within 3 standard deviations of the mean.

    Examples:

    - The masses of new born babies.- IQ of school students.- Hand span of adult females.- Height of plants growing in a field.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    20/22

    Edexcel Notes S1

    20

    Liverpool F.C.

    Working out Probabilities using tables.

    Examples:

    1.2.

    3.

    4.

    5.

    6.

    If P(Z < a) is greater than 0.5 than a will be >0. If P(Z < a) is less than 0.5, than a is less than 0. If P (Z > a) is less than 0.5 than a will be > 0. If P (Z > a) is more than 0.5 than a will be

  • 7/27/2019 Statistics 1 - Notes

    21/22

    Edexcel Notes S1

    21

    Liverpool F.C.

    Standardizing

    If and then:

    Example: If find

    The first step is to standardize:

    Working Backwards

    Example: If ,find the value of if .

    To findx, we start by finding the standardised value such that .

    From tables we see that .

    We therefore need to find the value that standardises to make by rearranging the formula.

    Examination style question: A machine is designed to fill jars of coffee so that the contents, , follow a

    normal distribution with mean grams and standard deviation grams.

    If and , find and correct to 3 significant figures.

    http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086
  • 7/27/2019 Statistics 1 - Notes

    22/22

    Edexcel Notes S1

    22

    Firstly : + 1.96

    Secondly, we are told that :

    - 1.75

    The two equations are:

    + 1.96

    - 1.75

    Subtract to eliminate :

    This gives

    So the solutions to 3sf are and g.