5
School of Six Sigma Descriptive Statistics Overview In this module we’re going to learn about descriptive statistics. By the end of this module you’ll know what descriptive statistics are as well as what the different measures of central tendency and dispersion are. Definition of Descriptive Statistics Let’s get started by explaining what descriptive statistics are and when they’re used. If someone were to ask you describe each of these people you’d likely talk about things such as their height, eye color, hair color, and so on. When we’re describing someone’s appearance we’re actually using a form of descriptive statistics that help us to describe our data.

SS EN 21 Descriptive Statistics - Amazon Web Services€¦ · Measures&of&Central&Tendency&& Whenit!comes!todescribing!our!data!we!generally!focus!on!two!characteristics:! measures!of!central!tendency!and!measures!of!dispersion

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SS EN 21 Descriptive Statistics - Amazon Web Services€¦ · Measures&of&Central&Tendency&& Whenit!comes!todescribing!our!data!we!generally!focus!on!two!characteristics:! measures!of!central!tendency!and!measures!of!dispersion

 

 

School  of  Six  Sigma    Descriptive  Statistics  

Overview  In  this  module  we’re  going  to  learn  about  descriptive  statistics.    By  the  end  of  this  module  you’ll  know  what  descriptive  statistics  are  as  well  as  what  the  different  measures  of  central  tendency  and  dispersion  are.      

Definition  of  Descriptive  Statistics  Let’s  get  started  by  explaining  what  descriptive  statistics  are  and  when  they’re  used.  

If  someone  were  to  ask  you  describe  each  of  these  people  you’d  likely  talk  about  things  such  as  their  height,  eye  color,  hair  color,  and  so  on.      

When  we’re  describing  someone’s  appearance  we’re  actually  using  a  form  of  descriptive  statistics  that  help  us  to  describe  our  data.      

 

Page 2: SS EN 21 Descriptive Statistics - Amazon Web Services€¦ · Measures&of&Central&Tendency&& Whenit!comes!todescribing!our!data!we!generally!focus!on!two!characteristics:! measures!of!central!tendency!and!measures!of!dispersion

Measures  of  Central  Tendency    When  it  comes  to  describing  our  data  we  generally  focus  on  two  characteristics:  measures  of  central  tendency  and  measures  of  dispersion.    Let’s  explore  the  different  measures  of  central  tendency.      

Mean  

The  first  measure  of  central  tendency  is  the  mean,  which  is  the  arithmetic  balance  point  or  average  of  a  distribution.    Calculating  the  mean  is  straightforward.    Here  

we  see  a  data  set  consisting  of  9  numbers:  3,  2,  6,  6,  8,  10,  6,  1,  and  4.    To  calculate  the  mean,  we  first  add  the  numbers  up,  which  in  our  example  equates  to  46.      

We  then  divide  this  figure  by  the  total  number  of  data  points,  which  is  9.    When  we  divide  46  by  9  we  learn  that  the  

mean,  or  the  average,  is  5.1.      We’ll  use  the  mean  to  describe  the  central  tendency  when  our  data  are  normally  distributed.      

We  can  usually  tell  when  our  data  are  normally  distributed  by  looking  at  it  in  a  graph  known  as  a  histogram,  which  we’ll  cover  later  in  the  course.    Additionally,  there  are  other  statistical  tests  we  can  run  to  determine  whether  our  data  are  normal  or  not.      

Median  

The  second  measure  of  central  tendency  is  the  median,  which  is  the  mid-­‐point  of  a  data  set.    Let’s  use  the  same  data  set  to  learn  how  to  calculate  the  median.    The  first  thing  we  must  do  is  arrange  the  numbers  in  ascending  order,  or  smallest  to  largest.    We  then  locate  the  midpoint  of  the  data  set  which,  in  our  example  is  6  since  it  lands  in  the  middle  of  the  data  set.    If  we  had  an  even  number  of  data  

Page 3: SS EN 21 Descriptive Statistics - Amazon Web Services€¦ · Measures&of&Central&Tendency&& Whenit!comes!todescribing!our!data!we!generally!focus!on!two!characteristics:! measures!of!central!tendency!and!measures!of!dispersion

points  we  would  simply  average  the  two  middle  figures  in  order  to  arrive  at  the  median.      

We’ll  use  the  median  to  describe  the  central  tendency  when  our  data  are  not  normally  distributed.    For  example,  in  this  histogram  we  can  see  that  the  data  are  skewed  to  the  right,  as  such  the  mean  may  not  be  reliable.      

We  often  see  the  median  used  within  the  real  estate  industry  to  describe  home  prices  since  most  neighborhoods  have  a  few  extremely  expensive  homes  that  artificially  drive  the  average  home  price  up.    Using  the  median,  which  isn’t  affected  by  a  few  outlier  data  points,  makes  the  most  sense.    If  you  ever  have  a  realtor  speaking  to  you  about  the  average  home  price  you  should  explain  to  them  why  using  the  median  is  more  appropriate.      

Mode  

The  last  measure  of  central  tendency  is  the  mode,  which  is  the  most  frequently  occurring  value  in  a  list.    The  mode  is  useful  when  dealing  with  attributes  data  and  is  actually  the  statistic  used  to  create  things  like  Pareto  charts.      

Let’s  learn  how  to  determine  the  mode  using  the  same  data  set  as  before.    While  not  mandatory,  it’s  helpful  to  once  again  order  the  data  in  ascending  order.    We  then  note  which  value  occurs  the  most,  which  in  our  example  is  6.    So  those  are  the  3  primary  measures  of  central  tendency.    

Measures  of  Dispersion  Let’s  now  turn  our  attention  to  measures  of  dispersion.    The  3  primary  measures  of  dispersion  are  the  range,  variance,  and  standard  deviation.    These  statistics  help  us  to  describe  the  variation  or  spread  in  our  data.      

Page 4: SS EN 21 Descriptive Statistics - Amazon Web Services€¦ · Measures&of&Central&Tendency&& Whenit!comes!todescribing!our!data!we!generally!focus!on!two!characteristics:! measures!of!central!tendency!and!measures!of!dispersion

Range  

Let’s  start  with  the  range,  which  is  the  difference  between  the  largest  and  smallest  observation  in  a  data  set.    Let’s  calculate  the  Range  using  the  same  data  

we’ve  been  working  with.    While  it’s  not  mandatory,  it’s  easier  if  you  order  the  data  from  smallest  to  largest.    You  then  simply  subtract  the  smallest  value  from  the  largest  value.    In  this  case  our  Range  is  10  –  1  or  9.      

We  typically  use  the  range  when  our  data  are  not  normally  distributed.    In  

other  words,  when  we  decide  to  use  the  median  as  our  measure  of  central  tendency  because  our  data  are  skewed  or  we  have  outliers  that  seem  to  be  driving  the  average  up,  we’ll  also  use  the  range  as  the  measure  of  dispersion  or  spread.      

Sample  Variance  

Next  we  have  the  sample  variance,  which  is  the  average  squared  distance  between  an  observation  and  the  mean.    You’ll  notice,  since  we’re  speaking  about  a  sample  statistic  we’re  using  the  Roman  letter  s.    The  math  looks  much  worse  than  it  really  it  is,  so  let’s  work  through  an  example.    For  this  example  our  data  set  consists  of  the  following  numbers  –  3.8,  4.1,  3.9,  and  4.4.    When  we  add  these  numbers  together  and  divide  by  4  we  learn  that  the  sample  mean  is  4.05.      

Believe  it  or  not,  this  all  we  need  to  calculate  the  sample  variance  which  is  noted  as  a  lower  case  s  squared.    In  order  to  calculate  the  sample  variance  we  simply  subtract  the  mean  from  each  data  point  before  squaring  it.    We  then  add  all  these  values  together  and  divide  them  by  the  number  samples  minus  1.    The  reason  we  subtract  1  from  the  number  of  samples  is  because  of  something  called  Bessel’s  correction  which  is  meant  to  help  us  correct  any  potential  bias  in  the  estimation  of  the  population  variance.      

Page 5: SS EN 21 Descriptive Statistics - Amazon Web Services€¦ · Measures&of&Central&Tendency&& Whenit!comes!todescribing!our!data!we!generally!focus!on!two!characteristics:! measures!of!central!tendency!and!measures!of!dispersion

This  is  what  it  looks  like  when  we  plug  our  values  into  the  formula.    For  example,  we  take  3.8  minus  4.05  and  square  that  and  then  add  that  to  4.1  minus  4.05  squared  and  so  on.    Once  we  work  out  the  math  we  learn  that  our  sample  variance  is  0.07.      

Sample  Standard  Deviation  

And  last,  we  come  to  the  sample  standard  deviation  which  is  simply  the  square  root  of  the  sample  variance.    Staying  with  the  example  we  just  worked  with,  when  we  take  the  square  root  of  0.07  we  learn  that  our  sample  standard  deviation  is  0.265.      

You  might  wonder  why  we  bother  calculating  the  sample  standard  deviation  when  we  already  know  the  sample  variance.    When  we  calculate  the  sample  variance  the  differences  are  squared,  meaning  the  units  of  the  sample  variance  are  not  the  same  as  the  units  of  the  actual  data  points.    By  taking  the  square  root  of  the  variance,  the  units  of  standard  deviation  match  the  original  data  points.      

When  our  data  are  normally  distributed,  we’ll  use  the  sample  standard  deviation  as  the  measure  of  dispersion  along  with  the  sample  mean  as  the  measure  of  central  tendency.    But  if  our  data  are  not  normally  distributed,  we’ll  typically  use  the  range  as  the  measure  of  dispersion  along  with  the  median  as  the  measure  of  central  tendency.