1.3.1 Measuring Center: The Mean - The arithmetic average

Preview:

Citation preview

1.3.1MeasuringCenter:TheMeanMean-Thearithmeticaverage.Tofindthemean (pronouncedxbar)ofasetofobservations,addtheirvaluesanddividebythenumberofobservations.Ifthenobservationsarex1,x2,…,xn,theirmeanis:

Or

Actually,thenotation referstothemeanofasample.Mostofthetime,thedatawe’llencountercanbethoughtofasasamplefromsomelargerpopulation.Whenweneedtorefertoapopulationmean,we’llusethesymbolμ(Greeklettermu,pronounced“mew”).Ifyouhavetheentirepopulationofdataavailable,thenyoucalculateμinjustthewayyou’dexpect:addthevaluesofalltheobservations,anddividebythenumberofobservations.Example–TravelTimestoWorkinNorthCarolinaCalculatingthemeanBelowisdataontraveltimesof15NorthCarolinaresidents.1)Findthemeantraveltimeforall15workers2)Calculatethemeanagain,thistimeexcludingthepersonwhoreporteda60-minutetraveltimetowork.Whatdoyounotice?

Thepreviousexampleillustratesanimportantweaknessofthemeanasameasureofcenter:themeanissensitivetotheinfluenceofextremeobservations.Thesemaybeoutliers,butaskeweddistributionthathasnooutlierswillalsopullthemeantowarditslongtail.Becausethemeancannotresisttheinfluenceofextremeobservations,wesaythatitisnotaresistantmeasureofcenter.ResistantMeasure-Astatisticthatisnotaffectedverymuchbyextremeobservations.

1.3.2MeasuringCenter:TheMedianMedian-ThemedianMisthemidpointofadistribution,thenumbersuchthathalftheobservationsaresmallerandtheotherhalfarelarger.Tofindthemedianofadistribution:

1. Arrangeallobservationsinorderofsize,fromsmallesttolargest.2. Ifthenumberofobservationsnisodd,themedianMisthecenterobservationintheordered

list.3. Ifthenumberofobservationsniseven,themedianMistheaverageofthetwocenter

observationsintheorderedlist.Example–TravelTimestoWorkinNorthCarolinaFindingthemedianwhennisoddWhatisthemediantraveltimeforour15NorthCarolinaworkers?Herearethedataarrangedinorder:

51010101012152020253030404060

Thecountofobservationsn=15isodd.Thebold20isthecenterobservationintheorderedlist,with7observationstoitsleftand7toitsright.Thisisthemedian,M=20minutes.

Example–StuckinTrafficFindingthemedianwhennisevenPeoplesaythatittakesalongtimetogettoworkinNewYorkStateduetotheheavytrafficnearbigcities.Whatdothedatasay?Herearethetraveltimesinminutesof20randomlychosenNewYorkworkers:

103052540201015302015208515651560604045

1.Makeastemplotofthedata.Besuretoincludeakey.2.Findaninterpretthemedian.

1.3.3ComparingtheMeanandtheMedianOurdiscussionoftraveltimestoworkinNorthCarolinaillustratesanimportantdifferencebetweenthemeanandthemedian.Themediantraveltime(themidpointofthedistribution)is20minutes.Themeantraveltimeishigher,22.5minutes.Themeanispulledtowardtherighttailofthisright-skeweddistribution.Themedian,unlikethemean,isresistant.Ifthelongesttraveltimewere600minutesratherthan60minutes,themeanwouldincreasetomorethan58minutesbutthemedianwouldnotchangeatall.Theoutlierjustcountsasoneobservationabovethecenter,nomatterhowfarabovethecenteritlies.Themeanusestheactualvalueofeachobservationandsowillchaseasinglelargeobservationupward.Themeanandmedianofaroughlysymmetricdistributionareclosetogether.Ifthedistributionisexactlysymmetric,themeanandmedianareexactlythesame.Inaskeweddistribution,themeanisusuallyfartheroutinthelongtailthanisthemedian.LeftSkewedDistributions RightSkewedDistribution

CheckYourUnderstandingQuestions1through4refertothefollowingsetting.Here,onceagain,isthestemplotoftraveltimestoworkfor20randomlyselectedNewYorkers.Earlier,wefoundthatthemedianwas22.5minutes.1.Basedonlyonthestemplot,wouldyouexpectthemeantraveltimetobelessthan,aboutthesameas,orlargerthanthemedian?Why? 2.Useyourcalculatortofindthemeantraveltime.WasyouranswertoQuestion1correct? 3.InterpretyourresultfromQuestion2incontextwithoutusingthewords“mean”or“average.”4.Wouldthemeanorthemedianbeamoreappropriatesummaryofthecenterofthisdistributionofdrivetimes?Justifyyouranswer.

1.3.4MeasuringSpread:TheInterquartileRange(IQR)Ausefulnumericaldescriptionofadistributionrequiresbothameasureofcenterandameasureofspread.HowtoCalculateQuartilesQ1|M|Q31.ArrangetheobservationsinincreasingorderandlocatethemedianMintheorderedlistofobservations.2.ThefirstquartileQ1isthemedianoftheobservationswhosepositionintheorderedlististotheleftofthemedian.3.ThethirdquartileQ3isthemedianoftheobservationswhosepositionintheorderedlististotherightofthemedian.InterquartileRange–IQR=Q3-Q1Example–TravelTimestoWorkinNorthCarolinaCalculatingquartilesOurNorthCarolinasampleof15workers’traveltimes,arrangedinincreasingorder,isThereisanoddnumberofobservations,sothemedianisthemiddleone,thebold20inthelist.Thefirstquartileisthemedianofthe7observationstotheleftofthemedian.Thisisthe4thofthese7observations,soQ1=10minutes(showninblue).Thethirdquartileisthemedianofthe7observationstotherightofthemedian,Q3=30minutes(showningreen).Sothespreadofthemiddle50%ofthetraveltimesisIQR=Q3−Q1=30−10=20minutes.BesuretoleaveouttheoverallmedianMwhenyoulocatethequartiles.

ThequartilesandtheinterquartilerangeareresistantbecausetheyarenotaffectedbyafewextremeobservationsExample–StuckinTrafficAgainFindingandinterpretingtheIQRFindandinterprettheinterquartilerange(IQR).

1.3.5IdentifyingOutliersInadditiontoservingasameasureofspread,theinterquartilerange(IQR)isusedaspartofaruleofthumbforidentifyingoutliers.1.5*IQR–Callanobservationanoutlierifitfallsmorethan1.5xIQRabovethethirdquartileorbelowthefirstquartileExample–TravelTimestoworkinNewYorkIdentifyingOutliersusingthe1.5*IQRruleIdentifyanyoutliersinthedatafromthestemplot.Q1=15minutesQ3=42.5minutesIQR=27.5minutesExample–TravelTimestoWorkinNorthCarolinaIdentifyingOutliersDetermineifthetraveltimeof60minutesinthesampleof15NorthCarolinaworkersisanoutlier.Q1=10minutesQ3=30minutesIQR=20minutes

1.3.6TheFive-NumberSummaryandBoxplotsFive-NumberSummary–Consistsofthesmallestobservation,thefirstquartile,themedian,thethirdquartile,andthelargestobservation,writteninorderfromsmallesttolargest.Insymbols,thefive-numbersummaryis

MinimumQ1MQ3Maximum

Thesefivenumbersdivideeachdistributionroughlyintoquarters.About25%ofthedatavaluesfallbetweentheminimumandQ1,about25%arebetweenQ1andthemedian,about25%arebetweenthemedianandQ3,andabout25%arebetweenQ3andthemaximum.Thefive-numbersummaryofadistributionleadstoanewgraph,theboxplot(akaboxandwhiskerplot).HowtoMakeaBoxplot1.Acentralboxisdrawnfromthefirstquartile(Q1)tothethirdquartile(Q3).2.Alineintheboxmarksthemedian.3.Lines(calledwhiskers)extendfromtheboxouttothesmallestandlargestobservationsthatarenotoutliers.Example–HomeRunKingMakingaBoxplotBarryBondssetthemajorleaguerecordbyhitting73homerunsinasingleseasonin2001.OnAugust7,2007,Bondshithis756thcareerhomerun,whichbrokeHankAaron’slongstandingrecordof755.Bytheendofthe2007seasonwhenBondsretired,hehadincreasedthetotalto762.HerearedataonthenumberofhomerunsthatBondshitineachofhis21completeseasons:

162524193325344637334240373449734645452628

Makeaboxplotfortheabovedata,theinitialstepshavebeendonetosaveyoutime.

CheckYourUnderstandingThe2009rosteroftheDallasCowboysprofessionalfootballteamincluded10offensivelinemen.Theirweights(inpounds)were

338318353313318326307317311311

1.Findthefive-numbersummaryforthesedatabyhand.Showyourwork.2.CalculatetheIQR.Interpretthisvalueincontext.3.Determinewhetherthereareanyoutliersusingthe1.5×IQRrule.4.Drawaboxplotofthedata.

1.3.7MeasuringSpread:TheStandardDeviationThefive-numbersummaryisnotthemostcommonnumericaldescriptionofadistribution.Thatdistinctionbelongstothecombinationofthemeantomeasurecenterandthestandarddeviationtomeasurespread.Thestandarddeviationanditscloserelative,thevariance,measurespreadbylookingathowfartheobservationsarefromtheirmean.Let’sexplorethisideausingasimplesetofdata.Example–HowManyPets?InvestigatingspreadaroundthemeanBelowlistsdatadetailingthenumberofpetsownedby9children.

134445789

Themeannumberofpetsis5.Let’slookatwheretheobservationsinthedatasetarerelativetothemean.Thefigureabovedisplaysthedatainadotplot,withthemeanclearlymarked.Thedatavalue1is4unitsbelowthemean.Wesaythatitsdeviationfromthemeanis−4.Whataboutthedatavalue7?Itsdeviationis7−5=2(itis2unitsabovethemean).Thearrowsinthefiguremarkthesetwodeviationsfromthemean.Thedeviationsshowhowmuchthedatavaryabouttheirmean.Theyarethestartingpointforcalculatingthevarianceandstandarddeviation.

Thetabletotheleftshowsthedeviationfromthe

mean foreachvalueinthedataset.Sumthedeviationsfromthemean.Youshouldget0,becausethemeanisthebalancepointofthedistribution.Sincethesumofthedeviationsfromthemeanwillbe0foranysetofdata,weneedanotherwaytocalculatespreadaroundthemean.Howcanwefixtheproblemofthepositiveandnegativedeviationscancelingout?Wecouldtaketheabsolutevalueofeachdeviation.Orwecouldsquarethedeviations.Formathematicalreasonsbeyondthescopeofthisbook,statisticianschoosetosquareratherthantouseabsolutevalues.

Wehaveaddedacolumntothetablethatshowsthe

squareofeachdeviation .Addupthesquareddeviations.Didyouget52?Nowwecomputetheaveragesquareddeviation—sortof.Insteadofdividingbythenumberofobservationsn,wedividebyn−1:

Thevalue6.5iscalledthevariance.

Variance- The average squared distance of the observations in a data set from their mean.

In symbols, Becausewesquaredallthedeviations,ourunitsarein“squaredpets.”That’snogood.We’lltakethesquareroottogetbacktothecorrectunits—pets.Theresultingvalueisthestandarddeviation:

This2.55isroughlytheaveragedistanceofthevaluesinthedatasetfromthemean.StandardDeviation-Thestandarddeviationsxmeasurestheaveragedistanceoftheobservationsfromtheirmean.Itiscalculatedbyfindinganaverageofthesquareddistancesandthentakingthe

squareroot.Thisaveragesquareddistanceiscalledthevariance.Insymbols,thevariance isgivenbyHowtoFindtheStandardDeviation

1. Findthedistanceofeachobservationfromthemeanandsquareeachofthesedistances.2. Averagethedistancesbydividingtheirsumbyn−1.3. Thestandarddeviationsxisthesquarerootofthisaveragesquareddistance:

Manycalculatorsreporttwostandarddeviations,givingyouachoiceofdividingbynorbyn−1.Theformerisusuallylabeledσx,thesymbolforthestandarddeviationofapopulation.Ifyourdatasetconsistsoftheentirepopulation,thenit’sappropriatetouseσx.Moreoften,thedatawe’reexaminingcomefromasample.Inthatcase,weshouldusesx.Moreimportantthanthedetailsofcalculatingsxarethepropertiesthatdeterminetheusefulnessofthestandarddeviation:

• sxmeasuresspreadaboutthemeanandshouldbeusedonlywhenthemeanischosenasthemeasureofcenter.

• sxisalwaysgreaterthanorequalto0.sx=0onlywhenthereisnovariability.Thishappensonlywhenallobservationshavethesamevalue.Otherwise,sx>0.Astheobservationsbecomemorespreadoutabouttheirmean,sxgetslarger.

• sxhasthesameunitsofmeasurementastheoriginalobservations.Forexample,ifyoumeasuremetabolicratesincalories,boththemeanXandthestandarddeviationsxarealsoin

calories.Thisisonereasontoprefersxtothevariance ,whichisinsquaredcalories.• LikethemeanX,sxisnotresistant.Afewoutlierscanmakesxverylarge.

TheuseofsquareddeviationsmakessxevenmoresensitivethanXtoafewextremeobservations.

CheckYourUnderstandingTheheights(ininches)ofthefivestartersonabasketballteamare67,72,76,76,and84.1.Findandinterpretthemean.2.Makeatablethatshows,foreachvalue,itsdeviationfromthemeananditssquareddeviationfromthemean.3.Showhowtocalculatethevarianceandstandarddeviationfromthevaluesinyourtable.4.Interpretthemeaningofthestandarddeviationinthissetting.

1.3.9ChoosingMeasureofCenterandSpreadWenowhaveachoicebetweentwodescriptionsofthecenterandspreadofadistribution:themedianandIQR,orXandsx.BecauseXandsxaresensitivetoextremeobservations,theycanbemisleadingwhenadistributionisstronglyskewedorhasoutliers.Inthesecases,themedianandIQR,whicharebothresistanttoextremevalues,provideabettersummary.We’llseeinthenextchapterthatthemeanandstandarddeviationarethenaturalmeasuresofcenterandspreadforaveryimportantclassofsymmetricdistributions,theNormaldistributions.ChoosingMeasuresofCenterandSpreadThemedianandIQRareusuallybetterthanthemeanandstandarddeviationfordescribingaskeweddistributionoradistributionwithstrongoutliers.UseXandsxonlyforreasonablysymmetricdistributionsthatdon’thaveoutliers.Rememberthatagraphgivesthebestoverallpictureofadistribution.Numericalmeasuresofcenterandspreadreportspecificfactsaboutadistribution,buttheydonotdescribeitsentireshape.Numericalsummariesdonothighlightthepresenceofmultiplepeaksorclusters,forexample.Alwaysplotyourdata.

Example-WhoTextsMore—MalesorFemales?PullingitalltogetherFortheirfinalproject,agroupofAPStatisticsstudentsinvestigatedtheirbeliefthatfemalestextmorethanmales.Theyaskedarandomsampleofstudentsfromtheirschooltorecordthenumberoftextmessagessentandreceivedoveratwo-dayperiod.Herearetheirdata:

Recommended