Data Description.pdf

Embed Size (px)

Citation preview

  • 7/29/2019 Data Description.pdf

    1/59

    Data Description

    MTK3006

    Department of Mathematics

    Faculty of Science and Technology

    Universiti Malaysia Terengganu

    [email protected]

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    2/59

    Part I

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    3/59

    Basic Terms

    Population

    A population is a collection of all subjects or objects of interest.

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    4/59

    Basic Terms

    Population

    A population is a collection of all subjects or objects of interest.

    Sample

    A sample is a portion or part of the population of interest.

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    5/59

    Basic Terms

    Variable

    A variable is a characteristic or attribute that can assume different values.

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    6/59

    Basic Terms

    Variable

    A variable is a characteristic or attribute that can assume different values.

    Data

    The values that a variable can assume are called data.

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    7/59

    Basic Terms

    Variable

    A variable is a characteristic or attribute that can assume different values.

    Data

    The values that a variable can assume are called data.

    Data set

    A collection of data values or measurements forms a data set.

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    8/59

    Basic Terms

    Variable

    A variable is a characteristic or attribute that can assume different values.

    Data

    The values that a variable can assume are called data.

    Data set

    A collection of data values or measurements forms a data set.

    Types of data

    Quantitative data is a numerical measurement expressed in terms ofnumbers.

    Qualitative data is a categorical measurement expressed by means of anatural language description.

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    9/59

    Basic Terms

    Parameter

    A parameter is a characteristic or measure obtained by using all the datavalues from a population.

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    10/59

    Basic Terms

    Parameter

    A parameter is a characteristic or measure obtained by using all the datavalues from a population.

    Statistic

    A statistic is a characteristic or measure obtained by using the data valuesfrom a sample.

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    11/59

    Basic Terms

    Statistics

    Statistics is the science of collecting, organizing, summarizing, analyzingand interpreting data.

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    12/59

    Basic Terms

    Statistics

    Statistics is the science of collecting, organizing, summarizing, analyzingand interpreting data.

    Areas of statistics

    The branch of statistics devoted to the organization, summarization,description and presentation of data sets is called descriptive statistics.

    The branch of statistics concerned with using sample data to drawconclusions about a population is called inferential statistics.

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    13/59

    Part II

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    14/59

    Describing Data with Tables

    Data collected in original form is called raw data.

    A frequency distribution is the organization of raw data in table form

    using classes and frequencies. There are three types of frequency distributions:

    Categorical frequency distributions Ungrouped frequency distributions Grouped frequency distributions

    MTK3006 Statistics for Chemists Data Description

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    15/59

    Categorical Frequency Distribution

    Can be used for data that can beplaced in specific categories.

    Examples political affiliation,

    religious affiliation, blood type, etc. Example Blood Type Data

    A,B,B,AB,O,O,O,B,AB,B,B,B,O,A,O,A,O,O,O,AB,AB,A,O,B,A

    Blood Type Frequency Distribution

    Class Frequency Percent

    A 5 20B 7 28O 9 36

    AB 4 16

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    16/59

    Ungrouped Frequency Distribution

    Can be used for data that can beenumerated and when the range ofvalues in the data set is not large.

    Examples number of kilometersyour instructors have to travel fromhome to campus, number of girls in4-child family, etc.

    Example Number of Kilometers

    Travelled: 8, 5, 6, 5, 5, 7, 7

    Number of Kilometers Travelled

    Class Frequency

    5 36 17 28 1

    MTK3006 Statistics for Chemists Data Description

    G d F Di ib i

    http://find/
  • 7/29/2019 Data Description.pdf

    17/59

    Grouped Frequency Distribution

    Can be used when the range of values in the data set is very large.

    Class limits represent the smallest and largest data values that can beincluded in a class. The smallest and largest possible data values in aclass are the lower and upper class limits.

    Class boundaries separate the classes. To find a class boundary,average the upper class limit of one class and the lower class limit ofthe next class.

    The class width is found by subtracting the lower (or upper) class limitof one class from the lower (or upper) class limit of the previous class.

    The class midpoint can be calculated by averaging the upper and lowerclass limits.

    MTK3006 Statistics for Chemists Data Description

    G d F Di ib i

    http://find/
  • 7/29/2019 Data Description.pdf

    18/59

    Grouped Frequency Distribution

    Rules for classes

    There should be 5-20 classes.

    The class width should be an odd number.

    The classes must not overlap.

    The classes must not have breaks.

    The classes must include all the data values.

    The classes must be equal in width.

    MTK3006 Statistics for Chemists Data Description

    G d F Di ib i

    http://find/
  • 7/29/2019 Data Description.pdf

    19/59

    Grouped Frequency Distribution

    To construct a grouped frequency distribution:

    Find the highest and lowest values.

    Find the range.

    Choose the number of classes.

    Find the class width by dividing the range by the number ofclasses and rounding up.

    Choose a starting point (usually the lowest value); add the classwidth to get all the lower limits.

    Find the upper class limits.

    Find the class boundaries.

    Find the frequencies and the cumulative frequencies.

    MTK3006 Statistics for Chemists Data Description

    G d F Di t ib ti

    http://find/
  • 7/29/2019 Data Description.pdf

    20/59

    Grouped Frequency Distribution

    Construct a grouped frequency distribution using 7 classes.

    112 100 127 120 134 118 105 110 109 112 110118 117 116 118 122 114 114 105 109 107 112114 115 118 117 118 122 106 110 116 108 110

    121 113 120 119 111 104 111 120 113 120 117105 110 118 112 114 114

    MTK3006 Statistics for Chemists Data Description

    G d F Di t ib ti

    http://find/
  • 7/29/2019 Data Description.pdf

    21/59

    Grouped Frequency Distribution

    Construct a grouped frequency distribution using 7 classes.

    112 100 127 120 134 118 105 110 109 112 110118 117 116 118 122 114 114 105 109 107 112114 115 118 117 118 122 106 110 116 108 110

    121 113 120 119 111 104 111 120 113 120 117105 110 118 112 114 114

    Class Limits Class Boundaries Frequency Cumulative Frequency100 - 104 99.5 - 104.5 2 2

    MTK3006 Statistics for Chemists Data Description

    Grouped Frequency Distribution

    http://find/
  • 7/29/2019 Data Description.pdf

    22/59

    Grouped Frequency Distribution

    Class Limits Class Boundaries Frequency Cumulative Frequency100 - 104 99.5 - 104.5 2 2105 - 109 104.5 - 109.5 8 10

    110 - 114 109.5 - 114.5 18 28115 - 119 114.5 - 119.5 13 41120 - 124 119.5 - 124.5 7 48125 - 129 124.5 - 129.5 1 49130 - 134 129.5 - 134.5 1 50

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    23/59

    Part III

    MTK3006 Statistics for Chemists Data Description

    Measures of Central Tendency

    http://find/
  • 7/29/2019 Data Description.pdf

    24/59

    Measures of Central Tendency

    Given a set of data, we often would like to have one number that isrepresentative of a population or sample.

    There are several standard ways to measure the center. Meanthe average of the data set

    Medianthe midpoint of the data set

    Modethe value that occurs most often in the data set

    MTK3006 Statistics for Chemists Data Description

    The Mean

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    25/59

    The Mean

    Denote by xi the ith observed data value in the population or sample.

    Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the

    total number of population values:

    =

    1

    N

    Ni=1

    xi.

    The sample mean is the sum of all the sample values divided by the number ofsample values:

    x = 1n

    ni=1

    xi.

    Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30.

    MTK3006 Statistics for Chemists Data Description

    The Mean

    http://find/http://goback/
  • 7/29/2019 Data Description.pdf

    26/59

    The Mean

    Denote by xi the ith observed data value in the population or sample.

    Denote by N and n the population and sample sizes respectively. The population mean is the sum of all the population values divided by the

    total number of population values:

    =

    1

    N

    Ni=1

    xi.

    The sample mean is the sum of all the sample values divided by the number ofsample values:

    x = 1n

    ni=1

    xi.

    Find the sample mean of 20, 26, 40, 36, 23, 42, 35, 24, 30.Answer: x = 30.67

    MTK3006 Statistics for Chemists Data Description

    The Median

    http://find/
  • 7/29/2019 Data Description.pdf

    27/59

    The Median

    The median is the middle value, or the average of the middle twovalues, of a population or sample, when the data values are arrangedfrom smallest to largest.

    The median will be one of the data values if there is an odd numberof values.

    The median will be the average of two data values if there is an evennumber of values.

    Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303.

    MTK3006 Statistics for Chemists Data Description

    The Median

    http://find/
  • 7/29/2019 Data Description.pdf

    28/59

    The Median

    The median is the middle value, or the average of the middle twovalues, of a population or sample, when the data values are arrangedfrom smallest to largest.

    The median will be one of the data values if there is an odd numberof values.

    The median will be the average of two data values if there is an evennumber of values.

    Find the median of 684, 764, 656, 702, 856, 1133, 1132, 1303.

    Answer: Median = 810

    MTK3006 Statistics for Chemists Data Description

    The Mode

    http://find/
  • 7/29/2019 Data Description.pdf

    29/59

    The Mode

    The mode is the value in the population or sample that occurs mostfrequently.

    It is sometimes said to be the most typical case.

    There may be no mode, one mode (unimodal), two modes (bimodal),or many modes (multimodal).

    Find the mode of 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10.

    Find the mode of 104, 104, 104, 104, 104, 107, 109, 109, 109, 110,

    109, 111, 112, 111, 109.

    MTK3006 Statistics for Chemists Data Description

    Properties of the Mean

    http://find/
  • 7/29/2019 Data Description.pdf

    30/59

    Properties of the Mean

    Uses all data values.

    Sample mean varies less than the sample median or mode.

    Used in computing other statistics, such as the variance. Unique, usually not one of the data values.

    Affected by extremely high or low values, called outliers.

    MTK3006 Statistics for Chemists Data Description

    Properties of the Median

    http://find/
  • 7/29/2019 Data Description.pdf

    31/59

    Properties of the Median

    Gives the midpoint.

    Used when it is necessary to find out whether the data values fall into

    the upper half or lower half of the data set. Affected less than the mean by extremely high or extremely low

    values.

    MTK3006 Statistics for Chemists Data Description

    Properties of the Mode

    http://find/
  • 7/29/2019 Data Description.pdf

    32/59

    p

    Used when the most typical case is desired.

    Easiest to compute. Not always unique or may not exist.

    MTK3006 Statistics for Chemists Data Description

    Measures of Dispersion

    http://find/
  • 7/29/2019 Data Description.pdf

    33/59

    p

    Dispersion refers to the spread or variability in a data set.

    Measures of dispersion include range, variance, standard deviation,etc.

    MTK3006 Statistics for Chemists Data Description

    The Range

    http://find/
  • 7/29/2019 Data Description.pdf

    34/59

    g

    The range is the difference between the highest and lowest values of a

    population or sample. Two experimental brands of outdoor paint are tested to see how long

    each will last before fading. Six cans of each brand constitute a smallpopulation. The results (in months) are:

    Brand A Brand B

    10 35

    60 45

    50 30

    30 35

    40 4020 25

    The population mean for both brands is the same.

    Which brand would you buy?

    MTK3006 Statistics for Chemists Data Description

    The Variance

    http://find/
  • 7/29/2019 Data Description.pdf

    35/59

    The variance is the average of the squares of the distance each value

    is from the mean. The population variance is

    2 =

    1

    N

    N

    i=1

    (xi )2.

    The sample variance is

    s2 =1

    n

    1

    n

    i=1

    (xi x)2.

    This formula for s2 makes a better estimator of2 than if we haddivided by n.

    MTK3006 Statistics for Chemists Data Description

    The Standard Deviation

    http://find/
  • 7/29/2019 Data Description.pdf

    36/59

    The standard deviation is the square root of the variance.

    The population standard deviation is .

    The sample standard deviation is s.

    The standard deviation is measured in the same unit as themeasurements in the population or sample.

    A large standard deviation indicates that the data values are far fromthe mean, whereas a small standard deviation indicates that they are

    clustered closely around the mean.

    MTK3006 Statistics for Chemists Data Description

    Alternate Formula for the Sample Standard Deviation

    http://find/
  • 7/29/2019 Data Description.pdf

    37/59

    s =

    1

    n 1

    n

    i=1x2

    i

    1

    n

    n

    i=1xi

    2

    Saves time when calculating by hand.

    Does not use the sample mean.

    Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4,14.3.

    MTK3006 Statistics for Chemists Data Description

    Alternate Formula for the Sample Standard Deviation

    http://find/
  • 7/29/2019 Data Description.pdf

    38/59

    s =

    1

    n 1

    n

    i=1x2

    i

    1

    n

    n

    i=1xi

    2

    Saves time when calculating by hand.

    Does not use the sample mean.

    Find the sample standard deviation of 11.2, 11.9, 12.0, 12.8, 13.4,14.3. Answer: s = 1.13

    MTK3006 Statistics for Chemists Data Description

    Measures of Position

    http://find/
  • 7/29/2019 Data Description.pdf

    39/59

    Measures of position or location are used to locate the relativeposition of a data value in the data set.

    These measures include: z-score quartiles outlier

    MTK3006 Statistics for Chemists Data Description

    The z-score

    http://find/
  • 7/29/2019 Data Description.pdf

    40/59

    A z-score or standard score for a value is obtained by subtracting themean from the value and dividing the result by the standard deviation.

    The formula for the population (or sample) z-score is

    z =x

    or =

    x x

    s

    .

    A z-score represents the number of standard deviations a value isabove or below the mean.

    MTK3006 Statistics for Chemists Data Description

    The Quartiles

    http://find/
  • 7/29/2019 Data Description.pdf

    41/59

    Quartiles separate the data set into 4 equal groups.

    The first quartile (Q1) is the value that lies 25% of the way up fromthe smallest value.

    The second quartile (Q2) is the value that lies 50% of the way upfrom the smallest value, and is equivalent to the median.

    The third quartile (Q3) is the value that lies 75% of the way up fromthe smallest value.

    The interquartile range (IQR) is the difference between the upper and

    lower quartiles, i.e., IQR = Q3 Q1.

    MTK3006 Statistics for Chemists Data Description

    The Outlier

    http://find/
  • 7/29/2019 Data Description.pdf

    42/59

    An outlier is an extremely high or low data value when compared withthe rest of the data values.

    A data value less than Q1 1.5 IQR or greater thanQ3 + 1.5 IQR can be considered an outlier.

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    43/59

    Part IV

    MTK3006 Statistics for Chemists Data Description

    Describing Data with Graphs

    http://find/
  • 7/29/2019 Data Description.pdf

    44/59

    Graphs used for qualitative data

    Bar charts Pareto charts

    Graphs used for quantitative data

    Histograms Frequency polygons Stem and leaf plots Box plots Time series plots

    MTK3006 Statistics for Chemists Data Description

    Bar Chart

    http://find/
  • 7/29/2019 Data Description.pdf

    45/59

    A bar chart is a chart withrectangular bars.

    The bars can be plottedvertically or horizontally.

    Example Modes ofTransportation to Work

    The vertical scale showsfrequencies.

    The horizontal scale shows

    categories.

    How people get to work

    People

    0

    5

    10

    15

    20

    25

    30

    Car Bus Train Walk

    MTK3006 Statistics for Chemists Data Description

    Pareto Chart

    http://find/
  • 7/29/2019 Data Description.pdf

    46/59

    A Pareto chart can be used to

    represent a categorical frequencydistribution. It is a bar chartarranged in descending order ofheight from left to right.

    How people get to work

    People

    0

    5

    10

    15

    20

    25

    30

    Car Train Bus Walk

    MTK3006 Statistics for Chemists Data Description

    Histogram

    http://find/
  • 7/29/2019 Data Description.pdf

    47/59

    The histogram is a graph that displays the quantitative data by usingvertical bars of various heights to represent the frequencies of the

    classes. The histogram is similar to the bar chart, but it is drawn without gaps

    between the bars. The class boundaries are represented on the horizontal axis.

    Record High Temperatures

    Temperature ( F)

    Frequency

    99.5 104.5 109.5 114.5 119.5 124.5 129.5 134.5

    0

    3

    6

    9

    12

    15

    18

    | | | | | | | |

    MTK3006 Statistics for Chemists Data Description

    Frequency Polygon

    http://find/
  • 7/29/2019 Data Description.pdf

    48/59

    The frequency polygon is a graph that displays the quantitative databy using lines that connect points plotted for the frequencies at the

    class midpoints. The frequencies are represented by the heights of the points. The class midpoints are represented on the horizontal axis.

    q

    q

    q

    q

    q

    q

    q qq

    Record High Temperatures

    Temperature ( F)

    Frequency

    102 107 112 117 122 127 1320

    3

    6

    9

    12

    15

    18

    | | | | | | |

    MTK3006 Statistics for Chemists Data Description

    Stem and Leaf Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    49/59

    A stem and leaf plot is a data plot that uses part of a data value asthe stem and part of the data value as the leaf to form groups orclasses.

    In a stem and leaf plot, each data value is split into a stem and a leaf.

    The leaf is usually the last digit of the data value and the other digits

    to the left of the leaf form the stem.

    For example, the number 123 would be split as:

    stem 12leaf 3

    The stems are listed on the left and the corresponding leaves on theright.

    MTK3006 Statistics for Chemists Data Description

    Stem and Leaf Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    50/59

    Construct a stem and leaf plot.

    25 31 20 32 1314 43 2 57 2336 32 33 32 4432 52 44 51 45

    MTK3006 Statistics for Chemists Data Description

    Stem and Leaf Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    51/59

    Construct a stem and leaf plot.

    25 31 20 32 1314 43 2 57 2336 32 33 32 4432 52 44 51 45

    0 21 3 4

    2 0 3 53 1 2 2 2 2 3 64 3 4 4 55 1 2 7

    MTK3006 Statistics for Chemists Data Description

    Box Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    52/59

    A box plot is a graph that presents information from a five-numbersummary.

    The five-number summary is composed of the minimum, Q1, median,Q3 and maximum.

    The five-number summary can be graphically represented by using abox plot.

    MTK3006 Statistics for Chemists Data Description

    Box Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    53/59

    To construct a box plot:

    Find the five-number summary.

    Draw a horizontal axis with a scale that includes the maximum andminimum data values.

    Draw a box with vertical sides through Q1 and Q3, and draw avertical line though the median.

    Draw a line from the minimum data value to the left side of the boxand a line from the maximum data value to the right side of the box.

    MTK3006 Statistics for Chemists Data Description

    Box Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    54/59

    Construct a box plot for the data:89, 47, 164, 296, 30, 215, 138, 78, 48, 39

    MTK3006 Statistics for Chemists Data Description

    Box Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    55/59

    Construct a box plot for the data:89, 47, 164, 296, 30, 215, 138, 78, 48, 39

    Five-number summary30-47-83.5-164-296

    MTK3006 Statistics for Chemists Data Description

    Box Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    56/59

    Construct a box plot for the data:89, 47, 164, 296, 30, 215, 138, 78, 48, 39

    Five-number summary30-47-83.5-164-296

    0 100 200 300

    30

    47 83.5 164

    296

    MTK3006 Statistics for Chemists Data Description

    Time Series Plot

    http://find/
  • 7/29/2019 Data Description.pdf

    57/59

    A time series plot represents data that occur over a specific period oftime.

    It is a line graph where the time is represented on the horizontal axisand the quantity that varies over time is represented on the verticalaxis.

    q

    qq

    q

    q

    q

    q

    q

    qq

    Temperature over a 9Hour Period

    Time

    Temperature(

    F)

    1 2 3 4 5 6 7 8 912

    35

    40

    45

    50

    55

    60

    MTK3006 Statistics for Chemists Data Description

    http://find/
  • 7/29/2019 Data Description.pdf

    58/59

    Part V

    MTK3006 Statistics for Chemists Data Description

    R and R Commander

    http://find/
  • 7/29/2019 Data Description.pdf

    59/59

    R

    A language and environment for statistical computing and graphics

    Available as a free software at http://www.r-project.org/

    A command-driven statistical program

    R Commander

    A graphical user interface for R

    Its interface includes menus, buttons and a few other elements

    MTK3006 Statistics for Chemists Data Description

    http://www.r-project.org/http://www.r-project.org/http://find/