R Tutorial.doc

  • Upload
    oana

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

  • 8/14/2019 R Tutorial.doc

    1/59

    R Tutorial

    Input

    Assignment

    The most straight forward way to store a list of numbers is through an assignment using the ccommand.(cstands for "combine.") The idea is that a list of numbers is stored under a given name, and the name isused to refer to the data. A list is specified with the ccommand, and assignment is specified with the "

  • 8/14/2019 R Tutorial.doc

    2/59

    'e assume that the data file is in the format called "comma separated values" (csv). That is, each linecontains a row of values which can be numbers or letters, and each value is separated by a comma. 'e alsoassume that the very first row contains a list of labels. The idea is that the labels in the top row are used torefer to the different columns of values.

    2irst we read a very short, somewhat silly, data file. The data file is called simple.csvand has three

    columns of data and si! rows. The three columns are labeled "trial," "mass," and "velocity." 'e canpretend that each row comes from an observation during one of two trials labeled "A" and "3." A copy ofthe data file is shown below and is created in defiance of 'erner 1eisenberg&

    "trial","mass","velcit!""",10,12"",11,14"#",5,$"#",%,10"",10&5,13"#",7,11

    The command to read the data file is read.csv. 'e have to give the command at least one arguments, but

    we will give three different arguments to indicate how the command can be used in different situations. Thefirst argument is the name of file. The second argument indicates whether or not the first row is a set oflabels. The third argument indicates that there is a comma between each number of each line. Thefollowing command will read in the data and assign it to a variable called "heisenberg&"

    > 'eisenber summar!('eisenber)trial mass velcit!3 in& 5&00 in& $&00#3 1st #u& %&25 1st u&10&25

    eian $&50 eian 11&50ean $&25 ean 11&333r u&10&3$ 3r u&12&75a& 11&00 a& 14&00

    >

    (+ote that if you are using a 4icrosoft system the file naming convention is different from what we usehere. f you want to use a bacslash it needs to be escaped, i.e. use two bacslashes together "55." Also youcan specify what folder to use by clicing on the "2ile" option in the main menu and choose the option tospecify your woring directory.)

    The variable "heisenberg" contains the three columns of data. 6ach column is assigned a name based on theheader (the first line in the file). ou can now access each individual column using a "7" to separate thetwo names&

    > 'eisenbertrial[1] # # #6evels #

    http://www.cyclismo.org/tutorial/R/simple.csvhttp://www.cyclismo.org/tutorial/R/simple.csvhttp://www.cyclismo.org/tutorial/R/simple.csv
  • 8/14/2019 R Tutorial.doc

    3/59

    > 'eisenbermass[1] 10&0 11&0 5&0 %&0 10&5 7&0> 'eisenbervelcit![1] 12 14 $ 10 13 11>

    f you are not sure what columns are contained in the variable you can use the namescommand&

    > names('eisenber)[1] "trial" "mass" "velcit!"

    'e will loo at another e!ample which is used throughout this tutorial. we will loo at the data found in aspreadsheet located athttp://cdiac.ornl.gov/ftp/ndp061a/trees1.!"1. A description of the data file islocated at http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.t#t. The original data is given in an e!celspreadsheet. t has been converted into a csv file, trees1.csv,by deleting the top set of rows and saving itas a "csv" file. This is an option to save within e!cel. (ou should save the fileon your computer.) t is agood idea to open this file in a spreadsheet and loo at it. This will help you mae sense of how R storesthe data.

    The data is used to indicate an estimate of biomass of ponderosa pine in a study performed by 8ale '.9ohnson, 9. Timothy 3all, and Roger 2. 'aler who are associated with the 3iological :ciences ;enter,8esert Research nstitute, .=. 3o! >?@@?, Reno, + B%#?> and the 6nvironmental and Resource :ciences;ollege of Agriculture, /niversity of +evada, Reno, + B%#@. The data is consists of #C lines, and eachline represents an observation. 6ach observation includes measurements and marers for @B differentmeasurements of a given tree. 2or e!ample, the first number in each row is a number, either , @, , or C,which signifies a different level of e!posure to carbon dio!ide. The si!th number in every row is anestimate of the biomass of the stems of a tree. +ote that the very first line in the file is a list of labels usedfor the different columns of data.

    The data can be read into a variable called "tree" in using the read.csvcommand&

    > tree attributes(tree)names

    [1] "8" "" "8:#." ".;" "6#" "=#" ".#""688"[9] "=88" ".88" "6#88" "=#88" ".#88" "6888" "=888"

    ".888"[17] "688" "=88" ".88" "6?88" "=?88" ".?88" "6;88""=;88"[25] ".;88" "6=88" "==88" ".=88"

    class[1] "ata&*rame"

    http://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txthttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txthttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/trees91.csv
  • 8/14/2019 R Tutorial.doc

    4/59

    r@&names[1] "1" "2" "3" "4" "5" "%" "7" "$" "9" "10" "11" "12" "13""14" "15"[1%] "1%" "17" "1$" "19" "20" "21" "22" "23" "24" "25" "2%" "27" "2$""29" "30"[31] "31" "32" "33" "34" "35" "3%" "37" "3$" "39" "40" "41" "42" "43""44" "45"[4%] "4%" "47" "4$" "49" "50" "51" "52" "53" "54"

    >

    The first thing that R stores is a list of names which refer to each column of the data. 2or e!ample, the firstcolumn is called ";", the second column is called "+." Tree is of type data.frame. 2inally, the rows arenumbered consecutively from to #C. 6ach column has #C numbers in it.

    f you now that a variable is a data frame but are not sure what labels are used to refer to the differentcolumns you can use the namescommand&

    > names(tree)[1] "8" "" "8:#." ".;" "6#" "=#" ".#""688"[9] "=88" ".88" "6#88" "=#88" ".#88" "6888" "=888"

    ".888"[17] "688" "=88" ".88" "6?88" "=?88" ".?88" "6;88""=;88"[25] ".;88" "6=88" "==88" ".=88">

    f you want to wor with the data in one of the columns you give the name of the data frame, a "7" sign,and the label assigned to the column. 2or e!ample, the first column in treecan be called using "tree7;&"

    > tree8[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 33 3 3 3 3[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4>

    $rief %ote on &i#ed 'idth &iles

    There are many ways to read data using R. 'e only give two e!amples, direct assignment and reading csvfiles. 1owever, another way deserves a brief mention. t is common to come across data that is organi-ed inflat files and delimited at preset locations on each line. This is often called a "fi!ed width file."

    The command to deal with these ind of files is read.fwf.6!amples of how to use this command are not

    given here, but if you would lie more information on how to use this command enter the followingcommand&

    > 'el(rea&*@*)

    $asic (ata )*pes

    %um+ers

  • 8/14/2019 R Tutorial.doc

    5/59

    The most basic way to store a number is to mae an assignment of a single number&

    > a

    The "

  • 8/14/2019 R Tutorial.doc

    6/59

    [1] >

    +ote that the -ero entry is used to indicate how the data is stored. The first entry in the vector is the firstnumber, and if you try to get a number past the last number you get "+A."

    Strings

    ou are not limited to 0ust storing numbers. ou can also store strings. A string is specified by using*uotes. 3oth single and double *uotes will wor&

    > a a[1] "'ell"> b b[1] "'ell" "t'ere"> b[1][1] "'ell"

    >

    &actors

    Another important way R can store data is as afactor. =ften times an e!periment includes trials fordifferent levels of some e!planatory variable. 2or e!ample, when looing at the impact of carbon dio!ideon the growth rate of a tree you might try to observe how different trees grow when e!posed to differentpreset concentrations of carbon dio!ide. The different levels are also calledfactors.

    Assuming you now how to read in a file, we will loo at the data file given in the first chapter. :everal ofthe variables in the file are factors&

    > summar!(tree8:#.)1 2 3 4 5 % 7 #1 #2 #3 #4 #5 #% #7 81 82 83 8485 8%

    3 1 1 3 1 3 1 1 3 3 3 3 3 3 1 3 1 31 187 86% 867 D1 D2 D3 D4 D5 D% D71 1 1 1 1 3 1 1 1 1

    >

    3ecause the set of options given in the data file corresponding to the ";13R" column are not all numbersR automatically assumes that it is a factor. 'hen you use summary on a factor it does not print out the fivepoint summary, rather it prints out the possible values and the fre*uency that they occur.

    n this data set several of the columns are factors, but the researchers used numbers to indicate the differentlevels. 2or e!ample, the first column, labeled ";," is a factor. 6ach trees was grown in an environment withone of four different possible levels of carbon dio!ide. The researchers *uite sensibly labeled these fourenvironments as , @, , and C. /nfortunately, R cannot determine that these are factors and must assumethat they are regular numbers.

    This is a common problem and there is a way to tell R to treat the ";" column as a set of factors. ouspecify that a variable is a factor using thefactorcommand. n the following e!ample weconvert tree$Cinto a factor&

    http://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.html
  • 8/14/2019 R Tutorial.doc

    7/59

    > tree8[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 33 3 3 3 3[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4> summar!(tree8) in& 1st u& eian ean 3r u& a&1&000 2&000 2&000 2&519 3&000 4&000

    > tree8 tree8[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 33 3 3 3 3[39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 46evels 1 2 3 4> summar!(tree8)1 2 3 4$ 23 10 13

    > levels(tree8)[1] "1" "2" "3" "4">

    =nce a vector is converted into a set of factors then R treats it in a different manner then when it is a set ofnumbers. A set of factors have a decrete set of possible values, and it does not mae sense to try to findaverages or other numerical descriptions. =ne thing that is important is the number of times that each factorappears, called their "fre*uencies," which is printed using thesummarycommand.

    (ata &rames

    Another way that information is stored is in data frames. This is a way to tae many vectors of differenttypes and store them in the same variable. The vectors can be of all different types. 2or e!ample, a dataframe may contain many lists, and each list might be a list of factors, strings, or numbers.

    There are different ways to create and manipulate data frames. 4ost are beyond the scope of this

    introduction. They are only mentioned here to offer a more complete description. lease see the firstchapterfor more information on data frames.

    =ne e!ample of how to create a data frame is given below&

    > a b levels bubba bubba *irst secn *

    1 1 2 2 2 4 #3 3 % 4 4 $ #> summar!(bubba) *irst secn *in& 1&00 in& 2&0 21st u&1&75 1st u&3&5 #2eian 2&50 eian 5&0ean 2&50 ean 5&0

    http://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.html
  • 8/14/2019 R Tutorial.doc

    8/59

    3r u&3&25 3r u&%&5a& 4&00 a& $&0

    > bubba*irst[1] 1 2 3 4> bubbasecn[1] 2 4 % $> bubba*[1] # #6evels #>

    )a+les

    Another common way to store information is in a table. 1ere we loo at how to define both one way andtwo way tables. 'e only loo at how to create and define tablesE the functions used in the analysis ofproportions are e!amined in another chapter.

    ,ne 'a* )a+les

    The first e!ample is for a one way table. =ne way tables are not the most interesting e!ample, but it is agood place to start. =ne way to create a table is using the tablecommand. The arguments it taes is a vectorof factors, and it calculates the fre*uency that each factor occurs. 1ere is an e!ample of how to create a oneway table&

    > a results resultsa # 84 3 2> attributes(results)im

    [1] 3

    imnamesimnamesa[1] "" "#" "8"

    class[1] "table"

    > summar!(results)umber * cases in table 9umber * *actrs 1>

    f you now the number of occurrences for each factor then it is possible to create the table directly, but theprocess is, unfortunately, a bit more convoluted. There is an easier way to define oneway tables (a tablewith one row), but it does not e!tend easily to twoway tables (tables with more than one row). ou mustfirst create a matri! of numbers. A matri! is lie a vector in that it is a list of numbers, but it is different inthat you can have both rows and columns of numbers. 2or e!ample, in our e!ample above the number ofoccurrences of "A" is C, the number of occurrences of "3" is , and the number of occurrences of ";" is @.'e will create one row of numbers. The first column contains a C, the second column contains a , and thethird column contains a @&

    http://www.cyclismo.org/tutorial/R/tables.htmlhttp://www.cyclismo.org/tutorial/R/tables.htmlhttp://www.cyclismo.org/tutorial/R/tables.html
  • 8/14/2019 R Tutorial.doc

    9/59

    > ccur ccur [,1] [,2] [,3][1,] 4 3 2

    At this point the variable "occur" is a matri! with one row and three columns of numbers. To dress it up

    and use it as a table we would lie to give it labels for each columns 0ust lie in the previous e!ample. =ncethat is done we convert the matri! to a table using the as.tablecommand&

    > clnames(ccur) ccur # 8[1,] 4 3 2> ccur ccur # 8 4 3 2> attributes(ccur)im[1] 1 3

    imnamesimnames[[1]][1] ""

    imnames[[2]][1] "" "#" "8"

    class[1] "table"

    >

    )!o 'a* )a+les

    f you want to add rows to your table 0ust add another vector to the argument of the table command. n thee!ample below we have two *uestions. n the first *uestion the responses are labeled "+ever,"":ometimes," or "Always." n the second *uestion the responses are labeled "es," "+o," or "4aybe." Theset of vectors "a," and "b," contain the response for each measurement. The third item in "a" is how thethird person responded to the first *uestion, and the third item in "b" is how the third person responded tothe second *uestion.

    > a b results results ba a!be Ees l@a!s 2 0 0 ever 0 1 1 =metimes 2 1 1>

  • 8/14/2019 R Tutorial.doc

    10/59

    The table command allows us to do a very *uic calculation, and we can immediately see that two peoplewho said "4aybe" to the first *uestion also said ":ometimes" to the second *uestion.

    9ust as in the case with oneway tables it is possible to manually enter two way tables. The procedure ise!actly the same as above e!cept that we now have more than one row. 'e give a brief e!ample below todemonstrate how to enter a twoway table that includes breadown of a group of people by both their

    gender and whether or not they smoe. ou enter all of the data as one long list but tell R to brea it up intosome number of columns&

    > sesmFe r@names(sesmFe) clnames(sesmFe) sesmFe sesmFe smFe nsmFemale 70 120*emale %5 140>

    $asic ,perations and %umerical (escriptions

    $asic ,perations

    =nce you have a vector (or a list of numbers) in memory most basic operations are available. 4ost of thebasic operations will act on a whole vector and can be used to *uicly perform a large number ofcalculations with a single command. There is one thing to note, if you perform an operation on more thanone vector it is often necessary that the vectors all contain the same number of entries.

    1ere we first define a vector which we will call "a" and will loo at how to add and subtract constantnumbers from all of the numbers in the vector. 2irst, the vector will contain the numbers , @, , and C. 'ethen see how to add # to each of the numbers, subtract ? from each of the numbers, multiply each numberby C, and divide each number by #.

    > a[1] 1 2 3 4> a C 5[1] % 7 $ 9> a - 10[1] -9 -$ -7 -%> aB4[1] 4 $ 12 1%> aG5[1] 0&2 0&4 0&% 0&$>

    'e can save the results in another vector called "b&"

    > b b[1] -9 -$ -7 -%>

    f you want to tae the s*uare root, find e raised to each number, the logarithm, etc., then the usualcommands can be used&

  • 8/14/2019 R Tutorial.doc

    11/59

    > sArt(a)[1] 1&000000 1&414214 1&732051 2&000000> e(a)[1] 2&71$2$2 7&3$905% 20&0$5537 54&59$150> l(a)[1] 0&0000000 0&%931472 1&09$%123 1&3$%2944> e(l(a))[1] 1 2 3 4>

    3y combining operations and using parentheses you can mae more complicated e!pressions&

    > c c[1] 0&23$405$ 0&40%9$42 0&5%40743 0&7152175>

    +ote that you can do the same operations with vector arguments. 2or e!ample to add the elements in vectora to the elements in vector b use the following command&

    > a C b[1] -$ -% -4 -2>

    The operation is performed on an element by element basis. +ote this is true for almost all of the basicfunctions. :o you can bring together all inds of complicated e!pressions&

    > aBb[1] -9 -1% -21 -24> aGb[1] -0&1111111 -0&2500000 -0&42$5714 -0&%%%%%%7> (aC3)G(sArt(1-b)B2-1)

    [1] 0&75123%4 1&0000000 1&2$$4234 1&%311303>

    ou need to be careful of one thing. 'hen you do operations on vectors they are performed on an elementby element basis. =ne ramification of this is that all of the vectors in an e!pression must be the samelength. f the lengths of the vectors differ then you may get an error message, or worse, a warning messageand unpredictable results&

    > a b aCb[1] 11 13 15 14Harnin messae

    lner bIect lent' is nt a multile * s'rter bIect lent' in a C b>

    As you wor in R and create new vectors it can be easy to loose trac of what variables you have defined.To get a list of all of the variables that have been defined use the ls()command&

    > ls()

  • 8/14/2019 R Tutorial.doc

    12/59

    [1] "a" "b" "bubba" "c""last&@arnin"[%] "tree" "trees">

    2inally, you should eep in mind that the basic operations almost always wor on an element by element

    basis. There are rare e!ceptions to this general rule. 2or e!ample, if you loo at the minimum of twovectors using the mincommand you will get the minimum of all of the numbers. There is a specialcommand, calledpmin, that may be the command you want in some circumstances&

    > a b min(a,b)[1] -4> min(a,b)[1] -1 -2 -3 -4>

    $asic %umerical (escriptions

    Fiven a vector of numbers there are some basic commands to mae it easier to get some of the basicnumerical descriptions of a set of numbers. 1ere we assume that you can read in the tree data that wasdiscussed in a previous chapter. t is assumed that it is stored in a variable called "tree&"

    > tree

  • 8/14/2019 R Tutorial.doc

    13/59

    [1] 0&7%49074> meian(tree6#)[1] 0&72> Auantile(tree6#) 0J 25J 50J 75J 100J0&1300 0&4$00 0&7200 1&0075 1&7%00> min(tree6#)[1] 0&13> ma(tree6#)[1] 1&7%> var(tree6#)[1] 0&14293$2> s(tree6#)[1] 0&37$0717>

    2inally, there is one command that will print out the min, ma!, mean, median, and *uantiles&

    > summar!(tree6#) in& 1st u& eian ean 3r u& a&0&1300 0&4$00 0&7200 0&7%49 1&00$0 1&7%00

    >

    Thesummarycommand is especially nice because if you give it a data frame it will print out the summaryfor every vector in the data frame&

    > summar!(tree) 8 8:#. .; 6#in& 1&000 in& 1&000 1 3 in& 1&00 in&

    0&13001st u&2&000 1st u&1&000 4 3 1st u& 9&00 1st

    u&0&4$00eian 2&000 eian 2&000 % 3 eian 14&00 eian

    0&7200ean 2&519 ean 1&92% #2 3 ean 13&05 ean

    0&7%493r u&3&000 3r u&3&000 #3 3 3r u&20&00 3r

    u&1&0075a& 4&000 a& 3&000 #4 3 a& 20&00 a&

    1&7%00(Kt'er)3% Ls 11&00

    =# .# 688 =88in& 0&0300 in& 0&1200 in& 0&$$0 in& 0&37001st u&0&1900 1st u&0&2$25 1st u&1&312 1st u&0&%400eian 0&2450 eian 0&4450 eian 1&550 eian 0&7$50ean 0&2$$3 ean 0&4%%2 ean 1&5%0 ean 0&7$72

    3r u&0&3$00 3r u&0&5500 3r u&1&7$$ 3r u&0&9350a& 0&7200 a& 1&5100 a& 2&7%0 a& 1&2900

    .88 6#88 =#88 .#88in& 0&4700 in& 25&00 in& 14&00 in& 15&001st u&0&%000 1st u&34&00 1st u&17&00 1st u&19&00eian 0&7500 eian 37&00 eian 1$&00 eian 20&00ean 0&7394 ean 3%&9% ean 1$&$0 ean 21&433r u&0&$100 3r u&41&00 3r u&20&00 3r u&23&00a& 1&5500 a& 4$&00 a& 27&00 a& 41&00

  • 8/14/2019 R Tutorial.doc

    14/59

    6888 =888 .888 688

    in& 0&2100 in& 0&1300 in& 0&1100 in& 0&%5001st u&0&2%00 1st u&0&1%00 1st u&0&1%00 1st u&0&$100eian 0&2900 eian 0&1700 eian 0&1%50 eian 0&9000ean 0&2$%9 ean 0&1774 ean 0&1%54 ean 0&90533r u&0&3100 3r u&0&1$75 3r u&0&1700 3r u&0&9900a& 0&3%00 a& 0&2400 a& 0&2400 a& 1&1$00

    Ls 1&0000=88 .88 6?88 =?88

    in& 0&$70 in& 0&330 in& 0&0700 in& 0&1001st u&0&940 1st u&0&400 1st u&0&1000 1st u&0&110eian 1&055 eian 0&475 eian 0&1200 eian 0&130ean 1&105 ean 0&473 ean 0&1109 ean 0&1353r u&1&210 3r u&0&520 3r u&0&1300 3r u&0&150a& 1&520 a& 0&%40 a& 0&1400 a& 0&190

    .?88 6;88 =;88 .;88in& 0&04000 in& 0&1500 in& 0&1500 in& 0&10001st u&0&0%000 1st u&0&2000 1st u&0&2200 1st u&0&1300

    eian 0&07000 eian 0&2400 eian 0&2$00 eian 0&1450ean 0&0%%4$ ean 0&23$1 ean 0&2707 ean 0&14%53r u&0&07000 3r u&0&2700 3r u&0&3175 3r u&0&1%00a& 0&09000 a& 0&3100 a& 0&4100 a& 0&2100

    6=88 ==88 .=88in& 0&0900 in& 0&1400 in& 0&09001st u&0&1325 1st u&0&1%00 1st u&0&1200eian 0&1%00 eian 0&1$00 eian 0&1300ean 0&1%%1 ean 0&1$17 ean 0&129$3r u&0&1$75 3r u&0&2000 3r u&0&1475a& 0&2%00 a& 0&2$00 a& 0&1700

    >

    $asic -ro+a+ilit* (istri+utions

    'e loo at some of the basic operations associated with probability distributions. There are a large numberof probability distributions available, but we only loo at a few. f you would lie to now whatdistributions are available you can do a search using the command help.search("distribution").

    1ere we give details about the commands associated with the normal distribution and briefly mention thecommands for other distributions. The functions for different distributions are very similar where thedifferences are noted below.

    )he %ormal (istri+ution

    There are four functions that can be used to generate the values associated with the normal distribution.ou can get a full list of them and their options using the help command&

    > 'el(rmal)

    The first function we loo at it dnorm. Fiven a set of values it returns the height of the probabilitydistribution at each point. f you only give the points it assumes you want to use a mean of -ero and

  • 8/14/2019 R Tutorial.doc

    15/59

  • 8/14/2019 R Tutorial.doc

    16/59

    [1] 2> Anrm(0&25,mean+2,s+2)[1] 0&%510205> Anrm(0&333)[1] -0&431%442> Anrm(0&333,s+3)[1] -1&294933> Anrm(0&75,mean+5,s+2)[1] %&34$9$> v + c(0&1,0&3,0&75)> Anrm(v)[1] -1&2$1551% -0&5244005 0&%744$9$> ! lt(,!)> ! lt(,!)> ! lt(,!)

    The last function we e!amine is the rnormfunction which can generate random numbers whose distributionis normal. The argument that you give it is the number of random numbers that you want, and it hasoptional arguments to specify the mean and standard deviation&

    > rnrm(4)[1] 1&23$7271 -0&2323259 -1&20030$1 -1&%71$4$3> rnrm(4,mean+3)[1] 2&%330$0 3&%174$% 2&03$$%1 2&%01933> rnrm(4,mean+3,s+3)[1] 4&5$055% 2&974903 4&75%097 %&395$94> rnrm(4,mean+3,s+3)[1] 3&000$52 3&7141$0 10&032021 3&295%%7> ! 'ist(!)> ! 'ist(!)> ! 'ist(!)> AAnrm(!)> AAline(!)

    )he t(istri+ution

    There are four functions that can be used to generate the values associated with the tdistribution. ou canget a full list of them and their options using the help command&

    > 'el(Dist)

    These commands wor 0ust lie the commands for the normal distri+ution. =ne difference is that thecommands assume that the values are normali-ed to mean -ero and standard deviation one, so you have touse a little algebra to use these functions in practice. The other difference is that you have to specify thenumber of degrees of freedom. The commands follow the same ind of naming convention, and the namesof the commands are dt,pt, qt, and rt.

    http://www.cyclismo.org/tutorial/R/probability.html#normalhttp://www.cyclismo.org/tutorial/R/probability.html#normal
  • 8/14/2019 R Tutorial.doc

    17/59

    A few e!amples are given below to show how to use the different commands. 2irst we have the distributionfunction, dt&

    > ! lt(,!)

    > ! lt(,!)

    +e!t we have the cumulative probability distribution function&

    > t(-3,*+10)[1] 0&00%%71$2$> t(3,*+10)[1] 0&99332$2> 1-t(3,*+10)[1] 0&00%%71$2$> t(3,*+20)[1] 0&99%4%2

    > + c(-3,-4,-2,-1)> t((mean()-2)Gs(),*+20)[1] 0&0011%554$> t((mean()-2)Gs(),*+40)[1] 0&000%030%4

    +e!t we have the inverse cumulative probability distribution function&

    > At(0&05,*+10)[1] -1&$124%1> At(0&95,*+10)[1] 1&$124%1> At(0&05,*+20)

    [1] -1&72471$> At(0&95,*+20)[1] 1&72471$> v At(v,*+253)[1] -2&595401 -1&9%93$5 -1&%50$99> At(v,*+25)[1] -2&7$743% -2&059539 -1&70$141>

    2inally random numbers can be generated according to the tdistribution&

    > rt(3,*+10)

    [1] 0&9440930 2&17343%5 0&%7$52%2> rt(3,*+20)[1] 0&1043300 -1&4%$219$ 0&0715013> rt(3,*+20)[1] 0&$023$32 -0&47597$0 -1&054%125

    )he $inomial (istri+ution

  • 8/14/2019 R Tutorial.doc

    18/59

    There are four functions that can be used to generate the values associated with the binomial distribution.ou can get a full list of them and their options using the help command&

    > 'el(#inmial)

    These commands wor 0ust lie the commands for the normal distri+ution. The binomial distribution

    re*uires two e!tra parameters, the number of trials and the probability of success for a single trial. Thecommands follow the same ind of naming convention, and the names of the commandsare dbinom,pbinom, qbinom, andrbinom.

    A few e!amples are given below to show how to use the different commands. 2irst we have the distributionfunction, dbinom&

    > ! lt(,!)> ! lt(,!)> ! lt(,!)

    +e!t we have the cumulative probability distribution function&

    > binm(24,50,0&5)[1] 0&443$%24> binm(25,50,0&5)[1] 0&55%137%> binm(25,51,0&5)[1] 0&5> binm(2%,51,0&5)[1] 0&%1011%

    > binm(25,50,0&5)[1] 0&55%137%> binm(25,50,0&25)[1] 0&9999%2> binm(25,500,0&25)[1] 4&955%5$e-33

    +e!t we have the inverse cumulative probability distribution function&

    > Abinm(0&5,51,1G2)[1] 25> Abinm(0&25,51,1G2)[1] 23

    > binm(23,51,1G2)[1] 0&2$79247> binm(22,51,1G2)[1] 0&200531

    2inally random numbers can be generated according to the binomial distribution&

    > rbinm(5,100,&2)[1] 30 23 21 19 1$

    http://www.cyclismo.org/tutorial/R/probability.html#normalhttp://www.cyclismo.org/tutorial/R/probability.html#normalhttp://www.cyclismo.org/tutorial/R/probability.html#normalhttp://www.cyclismo.org/tutorial/R/probability.html#normal
  • 8/14/2019 R Tutorial.doc

    19/59

    > rbinm(5,100,&7)[1] %% %% 5$ %$ %3>

    )he ChiSuared (istri+ution

    There are four functions that can be used to generate the values associated with the ;hi:*uareddistribution. ou can get a full list of them and their options using the help command&

    > 'el(8'isAuare)

    These commands wor 0ust lie the commands for thenormal distri+ution. The first difference is that it isassumed that you have normali-ed the value so no mean can be specified. The other difference is that youhave to specify the number of degrees of freedom. The commands follow the same ind of namingconvention, and the names of the commands are dchisq,pchisq, qchisq, and rchisq.

    A few e!amples are given below to show how to use the different commands. 2irst we have the distributionfunction, dchisq&

    > ! lt(,!)

    +e!t we have the cumulative probability distribution function&

    > c'isA(2,*+10)[1] 0&003%59$47> c'isA(3,*+10)[1] 0&01$57594> 1-c'isA(3,*+10)[1] 0&9$1424> c'isA(3,*+20)[1] 4&097501e-0%> + c(2,4,5,%)> c'isA(,*+20)[1] 1&114255e-07 4&%49$0$e-05 2&773521e-04 1&1024$$e-03

    +e!t we have the inverse cumulative probability distribution function&

    > Ac'isA(0&05,*+10)[1] 3&940299> Ac'isA(0&95,*+10)

    [1] 1$&30704> Ac'isA(0&05,*+20)[1] 10&$50$1> Ac'isA(0&95,*+20)[1] 31&41043> v Ac'isA(v,*+253)[1] 19$&$1%1 210&$355 217&1713> Ac'isA(v,*+25)[1] 10&519%5 13&11972 14&%1141

    http://www.cyclismo.org/tutorial/R/probability.html#normalhttp://www.cyclismo.org/tutorial/R/probability.html#normalhttp://www.cyclismo.org/tutorial/R/probability.html#normal
  • 8/14/2019 R Tutorial.doc

    20/59

    2inally random numbers can be generated according to the ;hi:*uared distribution&

    > rc'isA(3,*+10)[1] 1%&$0075 20&2$412 12&39099> rc'isA(3,*+20)[1] 17&$3$$7$ $&59193% 17&4$%372

    > rc'isA(3,*+20)[1] 11&19279 23&$%907 24&$1251

    4ore Tutorials&&ront -age

    $asic -lots

    'e loo at some of the ways R can display information graphically. This is a basic introduction to some ofthe basic plotting commands.

    n each of the topics that follow it is assumed that two different data sets, !1.datandtrees1.csvhavebeen read and defined using the same variables as in the first chapter. 3oth of these data sets come fromthe study discussed on the web site given in the first chapter. 'e assume that they are read using

    "read.csv" into variables "w" and "tree&"

    > @1 names(tree)[1] "8" "" "8:#." ".;" "6#" "=#" ".#""688"[9] "=88" ".88" "6#88" "=#88" ".#88" "6888" "=888"

    ".888"[17] "688" "=88" ".88" "6?88" "=?88" ".?88" "6;88""=;88"

    [25] ".;88" "6=88" "==88" ".=88">

    Strip Charts

    A strip chart is the most basic type of plot available. t plots the data in order along a line with each datapoint represented as a bo!. 1ere we provide e!amples using the "w" data frame mentioned at the top ofthis page, and the one column of the data is "w7vals."

    To create a strip chart of this data use thestripchartcommand&

    stric'art(@1vals)

    As you can see this is about as bare bones as you can get. There is no title nor a!es labels. t only showshow the data loos if you were to put it all along one line and mar out a bo! at each point. f you wouldprefer to see which points are repeated you can specify that repeated points be staced&

    > stric'art(@1vals,met'+"stacF")

    A variation on this is to have the bo!es moved up and down so that there is more separation between them&

    http://www.cyclismo.org/tutorial/R/index.htmlhttp://www.cyclismo.org/tutorial/R/w1.dathttp://www.cyclismo.org/tutorial/R/w1.dathttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/index.htmlhttp://www.cyclismo.org/tutorial/R/w1.dathttp://www.cyclismo.org/tutorial/R/trees91.csvhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.html
  • 8/14/2019 R Tutorial.doc

    21/59

    > stric'art(@1vals,met'+"Iitter")

    f you do not want the bo!es plotting in the hori-ontal direction you can plot them in the vertical direction&

    > stric'art(@1vals,vertical+./)> stric'art(@1vals,vertical+./,met'+"Iitter")

    :ince you should al!a*s annotateyour plots there are many different ways to add titles and labels. =neway is within the stripchart command itself&

    > stric'art(@1vals,met'+"stacF", main+L6ea* #iass in :i' 8K2 nvirnmentL, lab+L#iass * 6eavesL)

    f you have a plot already and want to add a title, you can use the titlecommand&

    > title(L6ea* #iass in :i' 8K2 nvirnmentL,lab+L#iass * 6eavesL)

    +ote that this simply adds the title and labels and will write over the top of any titles or labels you alreadyhave.

    istograms

    A histogram is very common plot. t plots the fre*uencies that data appears within certain ranges. 1ere weprovide e!amples using the "w" data frame mentioned at the top of this page, and the one column of datais "w7vals."

    To plot a histogram of the data use the "hist" command&

    > 'ist(@1vals)

    As you can see R will automatically calculate the intervals to use. There are man*options to determinehow to brea up the intervals. 1ere we loo at 0ust one way, varying the domain si-e and number of breas.f you would lie to now more about the other options chec out the help page&

    > 'el('ist)

    ou can specify the number of breas to use using the breaksoption. 1ere we loo at the histogram forvarious numbers of breas&

    > 'ist(@1vals,breaFs+2)> 'ist(@1vals,breaFs+4)> 'ist(@1vals,breaFs+%)

    > 'ist(@1vals,breaFs+$)> 'ist(@1vals,breaFs+12)>

    ou can also vary the si-e of the domain using thexlimoption. This option taes a vector with two entriesin it, the left value and the right value&

    > 'ist(@1vals,breaFs+12,lim+c(0,10))> 'ist(@1vals,breaFs+12,lim+c(-1,2))

  • 8/14/2019 R Tutorial.doc

    22/59

    > 'ist(@1vals,breaFs+12,lim+c(0,2))> 'ist(@1vals,breaFs+12,lim+c(1,1&3))> 'ist(@1vals,breaFs+12,lim+c(0&9,1&3))>

    The options for adding titles and labels are e!actly the same as for strip charts. ou should al!a*s

    annotateyour plots and there are many different ways to add titles and labels. =ne way is withinthe histcommand itself&

    > 'ist(@1vals, main+L6ea* #iass in :i' 8K2 nvirnmentL, lab+L#iass * 6eavesL)

    f you have a plot already and want to change or add a title, you can use the titlecommand&

    > title(L6ea* #iass in :i' 8K2 nvirnmentL,lab+L#iass * 6eavesL)

    +ote that this simply adds the title and labels and will write over the top of any titles or labels you alreadyhave.

    t is not uncommon to add other inds of plots to a histogram. 2or e!ample, one of the options tothestripchartcommand is to add it to a plot that has already been drawn. 2or e!ample, you might want tohave a histogram with the strip chart drawn across the top. The addition of the strip chart might give you abetter idea of the density of the data&

    > 'ist(@1vals,main+L6ea* #iass in :i' 8K2nvirnmentL,lab+L#iass * 6eavesL,!lim+c(0,1%))> stric'art(@1vals,a+./,at+15&5)

    $o#plots

    A bo!plot provides a graphical view of the median, *uartiles, ma!imum, and minimum of a data set. 1erewe provide e!amples using two different data sets. The first is the "w" data frame mentioned at the top ofthis page, and the one column of data is "w7vals." The second is the "tree" data frame from the"trees%.csv" data file which is also mentioned at the top of the page.

    'e first use the "w" data set and loo at the bo!plot of this data set&

    > blt(@1)

    Again, this is a very plain graph, and the title and labels can be specified in e!actly the same way as inthestripchartand histcommands&

    > blt(@1vals, main+L6ea* #iass in :i' 8K2 nvirnmentL,

    !lab+L#iass * 6eavesL)

    +ote that the default orientation is to plot the bo!plot vertically. 3ecause of this we used the ylaboption tospecify the a!is label. There are a large number of options for this command. To see more of the optionssee the help page&

    > 'el(blt)

  • 8/14/2019 R Tutorial.doc

    23/59

    As an e!ample you can specify that the bo!plot be plotted hori-ontally by specifying the horizontaloption&

    > blt(@1vals, main+L6ea* #iass in :i' 8K2 nvirnmentL, lab+L#iass * 6eavesL,

    'riMntal+./)

    The option to plot the bo! plot hori-ontally can be put to good use to display a bo! plot on the same imageas a histogram. ou need to specify the addoption, specify where to put the bo! plot using the atoption,and turn off the addition of a!es using the axesoption&

    > 'ist(@1vals,main+L6ea* #iass in :i' 8K2nvirnmentL,lab+L#iass * 6eavesL,!lim+c(0,1%))> blt(@1vals,'riMntal+./,at+15&5,a+./,aes+6=)

    f you are feeling really cra-y you can tae a histogram and add a bo! plot and a strip chart&

    > 'ist(@1vals,main+L6ea* #iass in :i' 8K2nvirnmentL,lab+L#iass * 6eavesL,!lim+c(0,1%))

    > blt(@1vals,'riMntal+./,at+1%,a+./,aes+6=)> stric'art(@1vals,a+./,at+15)

    :ome people shell out good money to have this much fun.

    2or the second part on bo!plots we will loo at the second data frame, "tree," which comes from the"trees%.csv" file. To reiterate the discussion at the top of this page and the discussion in thedatat*peschapter, we need to specify which columns are factors&

    > tree blt(tree=#, main+L=tem #iass in Di**erent 8K2 nvirnmentsL,

    !lab+L#iass * =temsL)

    That plot does not tell the whole story. t is for all of the trees, but the trees were grown in different inds ofenvironments. The boxplotcommand can be used to plot a separate bo! plot for each level. n this case thedata is held in "tree7:T34," and the different levels are stored as factors in "tree7;." The command tocreate different bo!plots is the following&

    blt(tree=#Ntree8)

    +ote that for the level called "@" there are four outliers which are plotted as little circles. There are manyoptions to annotate your plot including different labels for each level. lease usethe help(boxplot)command for more information.

    Scatter -lots

    A scatter plot provides a graphical view of the relationship between two sets of numbers. 1ere we providee!amples using the "tree" data frame from the "trees%.csv" data file which is mentioned at the top of the

    http://www.cyclismo.org/tutorial/R/types.htmlhttp://www.cyclismo.org/tutorial/R/types.htmlhttp://www.cyclismo.org/tutorial/R/types.htmlhttp://www.cyclismo.org/tutorial/R/types.htmlhttp://www.cyclismo.org/tutorial/R/types.htmlhttp://www.cyclismo.org/tutorial/R/types.html
  • 8/14/2019 R Tutorial.doc

    24/59

    page. n particular we loo at the relationship between the stem biomass ("tree7:T34") and the leafbiomass ("tree7G234").

    The command to plot each pair of points as anxcoordinate and aycoorindate is "plot&"

    > lt(tree=#,tree6#)

    t appears that there is a strong positive association between the biomass in the stems of a tree and theleaves of the tree. t appearsto be a linear relationship. n fact, the corelation between these two sets ofobservations is *uite high&

    > cr(tree=#,tree6#)[1] 0&911595>

    Fetting bac to the plot, you should always annotate your graphs. The title and labels can be specified ine!actly the same way as with the other plotting commands&

    > lt(tree=#,tree6#, main+".elatins'i #et@een =tem an 6ea* #imass", lab+"=tem #imass", !lab+"6ea* #imass")

    %ormal -lots

    The final type of plot that we loo at is the normal *uantile plot. This plot is used to determine if your datais close to being normally distributed. ou cannot be sure that the data is normally distributed, but you canrule out if it is not normally distributed. 1ere we provide e!amples using the "w" data frame mentioned atthe top of this page, and the one column of data is "w7vals."

    The command to generate a normal *uantile plot is qqnorm. ou can give it one argument, the univariate

    data set of interest&

    > AAnrm(@1vals)

    ou can annotate the plot in e!actly the same way as all of the other plotting commands given here&

    > AAnrm(@1vals, main+"rmal - ;lt * t'e 6ea* #imass",

    lab+"'eretical uantiles * t'e 6ea* #imass",!lab+"=amle uantiles * t'e 6ea* #imass")

    After you creat the normal *uantile plot you can also add the theoretical line that the data should fall on ifthey were normally distributed&

    > AAline(@1vals)

    n this e!ample you should see that the data is not *uite normally distributed. There are a few outliers, andit does not match up at the tails of the distribution.

    2inear 2east Suares Regression

  • 8/14/2019 R Tutorial.doc

    25/59

    1ere we loo at the most basic linear least s*uares regression. The main purpose is to provide an e!ampleof the basic commands. t is assumed that you now how to enter data or read data files which is covered inthe first chapter, and it is assumed that you are familiar with the different data t*pes.

    'e will e!amine the interest rate for four year car loans, and the data that we use comes from the 3.S.&ederal Reserve4s mean rates.'e are looing at and plotting means. This, of course, is a very bad thing

    because it removes a lot of the variance and is misleading. The only reason that we are woring with thedata in this way is to provide an e!ample of linear regression that does not use too many data points. 8o nottry this without a professional near you, and if a professional is not near you do not tell anybody you didthis. They will laugh at you. eople are mean, especially professionals.

    The first thing to do is to specify the data. 1ere there are only five pairs of numbers so we can enter them inmanually. 6ach of the five pairs consists of a year and the meaninterest rate&

    > !ear rate

  • 8/14/2019 R Tutorial.doc

    26/59

    lm(*rmula + rate N !ear)

    8e**icients(Ontercet) !ear

    1419&20$ -0&705

    'hen you mae the call to lmit returns a variable with a lot of information in it. f you are 0ust learningabout least s*uares regression you are probably only interested in two things at this point, the slope andtheyintercept. f you 0ust type the name of the variable returned by lmit will print out this minimalinformation to the screen. (:ee above.)

    f you would lie to now what else is stored in the variable you can use the attributescommand&

    > attributes(*it)names[1] "ce**icients" "resiuals" "e**ects" "ranF"[5] "*itte&values" "assin" "Ar" "*&resiual"[9] "levels" "call" "terms" "mel"

    class[1] "lm"

    =ne of the things you should notice is the coefficientsvariable within fit. ou can print out theyinterceptand slope by accessing this part of the variable&

    > *itce**icients[1](Ontercet)

    1419&20$> *itce**icients[[1]][1] 1419&20$

    > *itce**icients[2] !ear-0&705> *itce**icients[[2]][1] -0&705

    +ote that if you 0ust want to get the number you should use two s*uare braces. :o if you want to get anestimate of the interest rate in the year @?# you can use the formula for a line&

    > *itce**icients[[2]]B2015C*itce**icients[[1]][1] -1&3%7

    :o if you 0ust wait long enough, the bans will pay you to tae a carJ

    A better use for this formula would be to calculate the residuals and plot them&

    > res res[1] 0&132 -0&003 -0&17$ -0&1%3 0&212> lt(!ear,res)

    That is a bit messy, but fortunately there is an easier way to get the residuals&

  • 8/14/2019 R Tutorial.doc

    27/59

    > resiuals(*it) 1 2 3 4 50&132 -0&003 -0&17$ -0&1%3 0&212

    f you want to plot the regression line on the same plot as your scatter plot you can use the ablinefunctionalong with your variable fit&

    > lt(!ear,rate, main+"8mmercial #anFs Onterest .ate *r 4 Eear 8ar 6an", sub+"'ttGG@@@&*eeralreserve&vGreleasesG19G20050$05G")> abline(*it)

    2inally, as a teaser for the inds of analyses you might see later, you can get the results of an 2test byasing R for a summary of the fit variable&

    > summar!(*it)

    8alllm(*rmula + rate N !ear)

    .esiuals 1 2 3 4 50&132 -0&003 -0&17$ -0&1%3 0&212

    8e**icients stimate =t& rrr t value ;r(>PtP)(Ontercet) 1419&20$00 12%&94957 11&1$ 0&00153 BB!ear -0&70500 0&0%341 -11&12 0&0015% BB---=ini*& ces 0 LBBBL 0&001 LBBL 0&01 LBL 0&05 L&L 0&1 L L 1

    .esiual stanar errr 0&2005 n 3 erees * *reemultile .-=Auare 0&97%3, Iuste .-sAuare 0&9%$4-statistic 123&% n 1 an 3 D, -value 0&001559

    Calculating Confidence Intervals

    1ere we loo at some e!amples of calculating confidence intervals. The e!amples are for both normal and tdistributions. 'e assume that you canenter dataand now the commands associated with+asicpro+a+ilit*.

    . Calculating a Confidence Interval &rom a %ormal (istri+ution@. Calculating a Confidence Interval &rom a t (istri+ution. Calculating 5an* Confidence Intervals &rom a t (istri+ution

    Calculating a Confidence Interval &rom a %ormal (istri+ution

    1ere we will loo at a fictitious e!ample. 'e will mae some assumptions for what we might find in ane!periment and find the resulting confidence interval using a normal distribution. 1ere we assume that thesample mean is #, the standard deviation is @, and the sample si-e is @?. n the e!ample below we will use a%#K confidence level and wish to find the confidence interval. The commands to find the confidenceinterval in R are the following&

    http://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/confidence.html#normalhttp://www.cyclismo.org/tutorial/R/confidence.html#thttp://www.cyclismo.org/tutorial/R/confidence.html#multiThttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/confidence.html#normalhttp://www.cyclismo.org/tutorial/R/confidence.html#thttp://www.cyclismo.org/tutorial/R/confidence.html#multiT
  • 8/14/2019 R Tutorial.doc

    28/59

    > a s n errr le*t ri't le*t[1] 4&123477> ri't[1] 5&$7%523>

    The true mean has a probability of %#K of being in the interval between C.@ and #.BB.

    Calculating a Confidence Interval &rom a t (istri+ution

    ;alculating the confidence interval when using a ttest is similar to using a normal distribution. The onlydifference is that we use the command associated with the tdistribution rather than the normal distribution.1ere we repeat the procedures above, but we will assume that we are woring with a sample standard

    deviation rather than an e!act standard deviation.

    Again we assume that the sample mean is #, the sample standard deviation is @, and the sample si-e is @?.'e use a %#K confidence level and wish to find the confidence interval. The commands to find theconfidence interval in R are the following&

    > a s n errr le*t ri't le*t

    [1] 4&0%3971> ri't[1] 5&93%029>

    The true mean has a probability of %#K of being in the interval between C.?> and #.%C.

    'e now loo at an e!ample where we have a univariate data set and want to find the %#K confidenceinterval for the mean. n this e!ample we use one of the data sets given in the data inputchapter. 'e usethe w#data set&

    > @1

  • 8/14/2019 R Tutorial.doc

    29/59

    > mean(@1vals)[1] 0&7%5> s(@1vals)[1] 0&37$1222

    'e can now calculate an error for the mean&

    > errr errr[1] 0&1032075

    The confidence interval is found by adding and subtracting the error from the mean&

    > le*t ri't le*t[1] 0&%%17925> ri't

    [1] 0&$%$2075>

    There is a %#K probability that the true mean is between ?.>> and ?.B$.

    Calculating 5an* Confidence Intervals &rom a t (istri+ution

    :uppose that you want to find the confidence intervals for many tests. This is a common tas and mostsoftware pacages will allow you to do this.

    'e have three different sets of results&

    ;omparison

    4ean:td.8ev.

    +umber(pop.)

    Froup ? ??

    Froup ?.# @.# @?

    ;omparison @

    4ean:td.8ev.

    +umber(pop.)

    Froup @ C @?

  • 8/14/2019 R Tutorial.doc

    30/59

    Froup #. C?

    ;omparison

    4ean:td.8ev.

    +umber(pop.)

    Froup ? C.# C@?

    Froup @B.# C??

    2or each of these comparisons we want to calculate the associated confidence interval for the difference of

    the means. 2or each comparison there are two groups. 'e will refer to roup oneas the group whoseresults are in the first row of each comparison above. 'e will refer toroup twoas the group whose resultsare in the second row of each comparison above. 3efore we can do that we must first compute a standarderror and a tscore. 'e will find general formulae which is necessary in order to do all three calculations atonce.

    'e assume that the means for the first group are defined in a variable called m#. The means for the secondgroup are defined in a variable called m%. The standard deviations for the first group are in a variablecalledsd#. The standard deviations for the second group are in a variable called sd%. The number ofsamples for the first group are in a variable called num#. 2inally, the number of samples for the secondgroup are in a variable called num%.

    'ith these definitions the standard error is the s*uare root of (sdL@)MnumN(sd@L@)Mnum@. The R

    commands to do this can be found below&

    > m1 m2 s1 s2 num1 num2 se errr m1[1] 10 12 30> m2[1] 10&5 13&0 2$&5> s1[1] 3&0 4&0 4&5> s2[1] 2&5 5&3 3&0> num1

  • 8/14/2019 R Tutorial.doc

    31/59

  • 8/14/2019 R Tutorial.doc

    32/59

    'e first loo at how to calculate thepvalue using the Oscore. The Oscore is found by assuming that thenull hypothesis is true, subtracting the assumed mean, and dividing by the theoretical standard deviation.=nce the Oscore is found the probability that the value could be less the Oscore is found usingthepnormcommand.

    This is not enough to get thepvalue. f the Oscore that is found is positive then we need to tae one minus

    the associated probability. Also, for a two sided test we need to multiply the result by two. 1ere we avoidthese issues and insure that the Oscore is negative by taing the negative of the absolute value.

    'e now loo at a specific e!ample. n the e!ample below we will use a value of aof #, a standarddeviation of @, and a sample si-e of @?. 'e then find thepvalue for a sample mean of $&

    > a s n bar M M[1] 4&47213%

    > 2Bnrm(-abs(M))[1] 7&74421%e-0%>

    'e now loo at the same problem only specifying the mean and standard deviation withinthepnormcommand. +ote that for this case we cannot so easily force the use of the left tail. :ince thesample mean is more than the assumed mean we have to tae two times one minus the probability&

    > a s n bar 2B(1-nrm(bar,mean+a,s+sGsArt(20)))

    [1] 7&74421%e-0%>

    Calculating a Single p Value &rom a t (istri+ution

    2inding thepvalue using a tdistribution is very similar to using the Oscore as demonstrated above. Theonly difference is that you have to specify the number of degrees of freedom. 1ere we loo at the samee!ample as above but use the tdistribution instead&

    > a s n bar t t[1] 4&47213%> 2Bt(-abs(t),*+n-1)[1] 0&0002%11934>

    'e now loo at an e!ample where we have a univariate data set and want to find the pvalue. n thise!ample we use one of the data sets given in thedata inputchapter. 'e use the w#data set&

    http://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.html
  • 8/14/2019 R Tutorial.doc

    33/59

  • 8/14/2019 R Tutorial.doc

    34/59

    ;omparison @

    4ean:td.8ev.

    +umber(pop.)

    Froup @ C @?

    Froup #. C?

    ;omparison

    4ean:td.8ev.

    +umber(pop.)

    Froup ? C.# C@?

    Froup @B.# C??

    2or each of these comparisons we want to calculate a pvalue. 2or each comparison there are two groups.'e will refer toroup oneas the group whose results are in the first row of each comparison above. 'ewill refer toroup twoas the group whose results are in the second row of each comparison above. 3eforewe can do that we must first compute a standard error and a tscore. 'e will find general formulae which isnecessary in order to do all three calculations at once.

    'e assume that the means for the first group are defined in a variable called m#. The means for the secondgroup are defined in a variable called m%. The standard deviations for the first group are in a variablecalledsd#. The standard deviations for the second group are in a variable called sd%. The number ofsamples for the first group are in a variable called num#. 2inally, the number of samples for the secondgroup are in a variable called num%.

    'ith these definitions the standard error is the s*uare root of (sdL@)MnumN(sd@L@)Mnum@. Theassociated tscore is m minus m@ all divided by the standard error. The R comands to do this can be foundbelow&

    > m1 m2 s1 s2 num1 num2 se t m1

  • 8/14/2019 R Tutorial.doc

    35/59

    [1] 10 12 30> m2[1] 10&5 13&0 2$&5> s1[1] 3&0 4&0 4&5> s2[1] 2&5 5&3 3&0> num1[1] 300 210 420> num2[1] 230 340 400> se[1] 0&2391107 0&39$5074 0&2%5921%> t[1] -2&0910$2 -2&5093%4 5&%407%1

    To use theptcommand we need to specify the number of degrees of freedom. This can be done usingthepmincommand. +ote that there is also a command called min, but it does not wor the same way. ouneed to usepminto get the correct results. The numbers of degrees of freedom arepmin(num#&num%)'#. :othepvalues can be found using the following R command&

    > t(t,*+min(num1,num2)-1)[1] 0&01$$11%$ 0&00%42%$9 0&9999999$

    f you enter all of these commands into R you should have noticed that the lastpvalue is not correct.Theptcommand gives the probability that a score is less that the specified t. The tscore for the last entry ispositive, and we want the probability that a tscore is bigger. =ne way around this is to mae sure that all ofthe tscores are negative. ou can do this by taing the negative of the absolute value of the tscores&

    > t(-abs(t),*+min(num1,num2)-1)[1] 1&$$11%$e-02 %&42%$90e-03 1&%059%$e-0$

    The results from the command above should give you thepvalues for a onesided test. t is left as ane!ercise how to find thepvalues for a twosided test.

    Calculating )he -o!er ,f A )est

    1ere we loo at some e!amples of calculating the power of a test. The e!amples are for both normal and tdistributions. 'e assume that you canenter dataand now the commands associated with+asicpro+a+ilit*. All of the e!amples here are for a two sided test, and you can ad0ust them accordingly for aone sided test.

    . Calculating )he -o!er 3sing a %ormal (istri+ution@. Calculating )he -o!er 3sing a t (istri+ution. Calculating 5an* -o!ers &rom a t (istri+ution

    Calculating )he -o!er 3sing a %ormal (istri+ution

    1ere we calculate the power of a test for a normal distribution for a specific e!ample. :uppose that ourhypothesis test is the following&

    1?& mu D a,

    1a& mu not D a.

    http://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/power.html#normalhttp://www.cyclismo.org/tutorial/R/power.html#thttp://www.cyclismo.org/tutorial/R/power.html#multiThttp://www.cyclismo.org/tutorial/R/input.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/probability.htmlhttp://www.cyclismo.org/tutorial/R/power.html#normalhttp://www.cyclismo.org/tutorial/R/power.html#thttp://www.cyclismo.org/tutorial/R/power.html#multiT
  • 8/14/2019 R Tutorial.doc

    36/59

    The power of a test is the probability that we can the re0ect null hypothesis at a given mean that is awayfrom the one specified in the null hypothesis. 'e calculate this probability by first calculating theprobability that we accept the null hypothesis when we should not. This is the probability to mae a type error. The power is the probability that we do not mae a type error so we then tae one minus the resultto get the power.

    'e can fail to re0ect the null hypothesis if the sample happens to be within the confidence interval we findwhen we assume that the null hypothesis is true. To get the confidence interval we find the margin of errorand then add and subtract it to the proposed mean, a, to get the confidence interval. 'e then turn aroundand assume instead that the true mean is at a different, e!plicitly specified level, and then find theprobability a sample could be found within the original confidence interval.

    n the e!ample below the hypothesis test is for

    1?& mu D #,

    1a& mu not D #.

    'e will assume that the standard deviation is @, and the sample si-e is @?. n the e!ample below we willuse a %#K confidence level and wish to find the power to detect a true mean that differs from # by anamount of .#. (All of these numbers are made up solely for this e!ample.) The commands to find theconfidence interval in R are the following&

    > a s n errr le*t ri't le*t[1] 4&123477> ri't[1] 5&$7%523

    >

    +e!t we find the Oscores for the left and right values assuming that the true mean is #N.#D>.#&

    > assume Qle*t Qri't [1] 0&0$1%3792

    The probability that we mae a type error if the true mean is >.# is appro!imately B.K. :o the power ofthe test is p&

    > 1-[1] 0&91$3%2

    n this e!ample, the power of the test is appro!imately %.BK. f the true mean differs from # by .# thenthe probability that we will re0ect the null hypothesis is appro!imately %.BK.

    Calculating )he -o!er 3sing a t (istri+ution

  • 8/14/2019 R Tutorial.doc

    37/59

    ;alculating the power when using a ttest is similar to using a normal distribution. =ne difference is that weuse the command associated with the tdistribution rather than the normal distribution. 1ere we repeat thetest above, but we will assume that we are woring with a sample standard deviation rather than an e!actstandard deviation. 'e will e!plore three different ways to calculate the power of a test. The first methodmaes use of the scheme many boos recommend if you do not have the noncentral distribution available.The second does mae use of the noncentral distribution, and the third maes use of a single command thatwill do a lot of the wor for us.

    n the e!ample the hypothesis test is the same as above,

    1?& mu D #,

    1a& mu not D #.

    Again we assume that the sample standard deviation is @, and the sample si-e is @?. 'e use a %#Kconfidence level and wish to find the power to detect a true mean that differs from # by an amount of .#.The commands to find the confidence interval in R are the following&

    > a s n errr le*t ri't le*t[1] 4&0%3971> ri't[1] 5&93%029>

    The number of observations is large enough that the results are *uite close to those in the e!ample using thenormal distribution. +e!t we find the tscores for the left and right values assuming that the true mean is#N.#D>.#&

    > assume tle*t tri't [1] 0&11125$3

    The probability that we mae a type error if the true mean is >.# is appro!imately .K. :o the power ofthe test is p&

    > 1-[1] 0&$$$7417

    n this e!ample, the power of the test is appro!imately BB.%K. f the true mean differs from # by .# thenthe probability that we will re0ect the null hypothesis is appro!imately BB.%K. +ote that the powercalculated for a normal distribution is slightly higher than for this one calculated with the tdistribution.

    Another way to appro!imate the power is to mae use of the noncentrality parameter. The idea is that yougive it the critical tscores and the amount that the mean would be shifted if the alternate mean were the truemean. This is the method that most boos recommend.

  • 8/14/2019 R Tutorial.doc

    38/59

    > nc t t(t,*+n-1,nc+nc)-t(-t,*+n-1,nc+nc)[1] 0&1111522> 1-(t(t,*+n-1,nc+nc)-t(-t,*+n-1,nc+nc))[1] 0&$$$$47$

    Again, we see that the probability of maing a type error is appro!imately .K, and the power isappro!imately BB.%K. +ote that this is slightly different than the previous calculation but is still close.

    2inally, there is one more command that we e!plore. This command allows us to do the same powercalculation as above but with a single command.

    > @er&t&test(n+n,elta+1&5,s+s,si&level+0&05, t!e+"ne&samle",alternative+"t@&sie",strict + ./)

    Kne-samle t test @er calculatin

    n + 20

    elta + 1&5 s + 2 si&level + 0&05 @er + 0&$$$$47$ alternative + t@&sie

    This is a powerful command that can do much more than 0ust calculate the power of a test. 2or e!ample itcan also be used to calculate the number of observations necessary to achieve a given power. 2or moreinformation chec out the help page, help(power.t.test).

    Calculating 5an* -o!ers &rom a t (istri+ution

    :uppose that you want to find the powers for many tests. This is a common tas and most softwarepacages will allow you to do this. 1ere we see how it can be done in R. 'e use the e!act same cases as inthe previous chapter.

    1ere we assume that we want to do a twosided hypothesis test for a number of comparisons and want tofind the power of the tests to detect a point difference in the means. n particular we will loo at threehypothesis tests. All are of the following form&

    1?& mu mu@D ?,

    1a& mu mu@not D ?,

    'e have three different sets of comparisons to mae&

    ;omparison

    4ean:td.8ev.

    +umber(pop.)

    Froup ? ??

    http://www.cyclismo.org/tutorial/R/pValues.htmlhttp://www.cyclismo.org/tutorial/R/pValues.html
  • 8/14/2019 R Tutorial.doc

    39/59

    Froup ?.# @.# @?

    ;omparison @

    4ean:td.8ev.

    +umber(pop.)

    Froup @ C @?

    Froup #. C?

    ;omparison

    4ean:td.8ev.

    +umber(pop.)

    Froup ? C.# C@?

    Froup @B.# C??

    2or each of these comparisons we want to calculate the power of the test. 2or each comparison there are

    two groups. 'e will refer toroup oneas the group whose results are in the first row of each comparisonabove. 'e will refer toroup twoas the group whose results are in the second row of each comparisonabove. 3efore we can do that we must first compute a standard error and a tscore. 'e will find generalformulae which is necessary in order to do all three calculations at once.

    'e assume that the means for the first group are defined in a variable called m#. The means for the secondgroup are defined in a variable called m%. The standard deviations for the first group are in a variablecalledsd#. The standard deviations for the second group are in a variable called sd%. The number ofsamples for the first group are in a variable called num#. 2inally, the number of samples for the secondgroup are in a variable called num%.

    'ith these definitions the standard error is the s*uare root of (sdL@)MnumN(sd@L@)Mnum@. The Rcommands to do this can be found below&

    > m1 m2 s1 s2 num1 num2 se

  • 8/14/2019 R Tutorial.doc

    40/59

  • 8/14/2019 R Tutorial.doc

    41/59

  • 8/14/2019 R Tutorial.doc

    42/59

    :ometimes you are given data in the form of a table and would lie to create a table. 1ere we e!amine howto create the table directly. /nfortunately, this is not as direct a method as might be desired. 1ere we createan array of numbers, specify the row and column names, and then convert it to a table.

    n the e!ample below we will create a table identical to the one given above. n that e!ample we have columns, and the numbers are specified by going across each row from top to bottom. 'e need to specify

    the data and the number of rows&

    > smFe clnames() r@names() smFe smFe :i' 6@ ilecurrent 51 43 22*rmer 92 2$ 21never %$ 22 9

    )ools &or 'or"ing 'ith )a+les

    1ere we loo at some of the commands available to help loo at the information in a table in differentways. 'e assume that the data using one of the methods above, and the table is called "smoe." 2irst, thereare a couple of ways to get graphical views of the data&

    > barlt(smFe,leen+,besie+,main+L=mFin =tatus b! ==L)> lt(smFe,main+"=mFin =tatus #! =ciecnmic =tatus")

    There are a number of ways to get the marginal distributions using the marin.tablecommand. f you 0ustgive the command the table it calculates the total number of observations. ou can also calculate themarginal distributions across the rows or columns based on the one optional argument&

    > marin&table(smFe)[1] 35%> marin&table(smFe,1)

    current *rmer never11% 141 99

    > marin&table(smFe,2)

    :i' 6@ ile211 93 52

    ;ombining these commands you can get the proportions&

    > smFeGmarin&table(smFe)

    :i' 6@ ile current 0&14325$43 0&1207$%52 0&0%179775 *rmer 0&25$42%97 0&07$%51%9 0&05$9$$7% never 0&19101124 0&0%179775 0&0252$090> marin&table(smFe,1)Gmarin&table(smFe)

    current *rmer never0&325$427 0&39%0%74 0&27$0$99

  • 8/14/2019 R Tutorial.doc

    43/59

    > marin&table(smFe,2)Gmarin&table(smFe)

    :i' 6@ ile0&592%9%% 0&2%123%0 0&14%0%74

    That is a little obtuse, so fortunately, there is a better way to get the proportions using

    theprop.tablecommand. ou can specify the proportions with respect to the different marginaldistributions using the optional argument&

    > r&table(smFe)

    :i' 6@ ile current 0&14325$43 0&1207$%52 0&0%179775 *rmer 0&25$42%97 0&07$%51%9 0&05$9$$7% never 0&19101124 0&0%179775 0&0252$090> r&table(smFe,1)

    :i' 6@ ile current 0&439%552 0&370%$97 0&1$9%552 *rmer 0&%524$23 0&19$5$1% 0&14$93%2 never 0&%$%$%$7 0&2222222 0&0909091> r&table(smFe,2)

    :i' 6@ ile current 0&24170%2 0&4%23%5% 0&42307%9 *rmer 0&43%0190 0&3010753 0&403$4%2 never 0&3222749 0&23%5591 0&17307%9

    f you want to do a chis*uared test to determine if the proportions are different, there is an easy way to dothis. f we want to test at the %#K confidence level we need only loo at a summary of the table&

    > summar!(smFe)umber * cases in table 35%umber * *actrs 2est *r ineenence * all *actrs 8'isA + 1$&51, * + 4, -value + 0&0009$0$

    :ince the pvalue is less that #K we can re0ect the null hypothesis at the %#K confidence level and can saythat the proportions vary.

    =f course, there is a hard way to do this. This is not for the faint of heart and involves some linear algebrawhich we will not describe. f you wish to calculate the table of e!pected values then you need to multiplythe vectors of the margins and divide by the total number of observations&

    > eecte eecte

    :i' 6@ ile current %$&752$1 30&30337 1%&943$2 *rmer $3&57022 3%&$3427 20&59551 never 5$&%7%97 25&$%23% 14&4%0%7

    (he "t" function takes the transpose of the array.)

  • 8/14/2019 R Tutorial.doc

    44/59

    The result in this array and can be directly compared to the e!isting table. 'e need the s*uare of thedifference between the two tables divided by the e!pected values. The sum of all these values is the ;his*uared statistic&

    > c'i c'i

    [1] 1$&50974

    'e can then get thepvalue for this statistic&

    > 1-c'isA(c'i,*+4)[1] 0&0009$0$23%

    Case Stud*: 'or"ing )hrough a ' -ro+lem

    'e loo at a sample homewor problem and the R commands necessary to e!plore the problem. t isassumed that you are familiar will all of the commands discussed throughout this tutorial.

    . -ro+lem Statement@. )ransforming the (ata. )he Confidence IntervalC. )est of Significance#. )he -o!er of the test

    -ro+lem Statement

    This problem comes from the #th

    edition of 4oore and 4c;abes Introduction to the -ractice ofStatisticsand can be found on pp C>>C>$. The data consists of the emissions of three different pollutantsfrom C> different engines. A copy of the data we use here is availa+le. The problem e!amined here isdifferent from that given in the boo but is motivated by the discussion in the boo.

    n the following e!amples we will loo at the carbon mono!ide data which is one of the columns of thisdata set. 2irst we will transform the data so that it is close to being normally distributed. 'e will then findthe confidence interval for the mean and then perform a significance test to evaluate whether or not the datais away from a fi!ed standard. 2inally, we will find the power of the test to detect a fi!ed difference fromthat standard. 'e will assume that a confidence level of %#K is used throughout.

    )ransforming the (ata

    'e first begin a basic e!amination of the data. The first step is to read in the file and get a summary of thecenter and spread of the data. n this instance we will focus only on the carbon mono!ide data.

    > enine

  • 8/14/2019 R Tutorial.doc

    45/59

    1st u&12&75 1st u&0&4375 1st u& 4&3$$ 1st u&1&110eian 24&50 eian 0&5100 eian 5&905 eian 1&315ean 24&00 ean 0&5502 ean 7&$79 ean 1&3403r u&35&25 3r u&0&%025 3r u&10&015 3r u&1&495a& 4%&00 a& 1&1000 a& 23&530 a& 2&940

    >

    At first glance the carbon mono!ide data appears to be sewed. The spread between the third *uartile andthe ma! is five times the spread between the min and the first *uartile. A bo!plot is show in 2igure .showing that the data appears to be sewed. This is further confirmed in the histogram which is shown in2igure @. 2inally, a normal ** plot is given in 2igure , and the data does not appear to be normal.

    > AAnrm(eninec,main+"8arbn nie")> AAline(eninec)> blt(eninec,main+"8arbn nie")> 'ist(eninec,main+"8arbn nie")> AAnrm(eninec,main+"8arbn nie")> AAline(eninec)>

    2igure . 3o!plot of the ;arbon 4ono!ide 8ata. 2igure @. 1istogram of the ;arbon 4ono!ide 8ata.

  • 8/14/2019 R Tutorial.doc

    46/59

  • 8/14/2019 R Tutorial.doc

    47/59

    2igure C. 3o!plot of the Gogarithm of the ;arbon 4ono!ide8ata.

    2igure #. 1istogram of the Gogarithm of the ;arbon 4ono!

    2igure >. +ormal QQ lot of the Gogarithm of the ;arbon 4

    There is strong evidence that the logarithm of the carbon mono!ide data more closely resembles a normaldistribution then does the raw carbon mono!ide data. 2or that reason all of the analysis that follows will befor the logarithm of the data and will mae use of the new list "lengine."

    )he Confidence Interval

    'e now find the confidence interval for the carbon mono!ide data. As stated above, we will wor with thelogarithm of the data because it appears to be closer to a normal distribution. This data is stored in the list

  • 8/14/2019 R Tutorial.doc

    48/59

    called "lengine." :ince we do not now the true standard deviation we will use the sample standarddeviation and will use a tdistribution.

    'e first find the sample mean, the sample standard deviation, and the number of observations&

    > m s n ri't[1] 2&057431>

    The %#K confidence interval is between .$ and @.?>. eep in mind that this is for the logarithm so the%#K confidence interval for the original data can be found by "undoing" the logarithm&

    > e(le*t)[1] 5&52$54$> e(ri't)[1] 7&$25$4>

    :o the %#K confidence interval for the carbon mono!ide is between #.# and $.B.

    )est of Significance

    'e now perform a test of significance. 1ere we suppose that ideally the engines should have a mean levelof #.C and do a twosided hypothesis test. 1ere we assume that the true mean is labeled "mu" and state thehypothesis test&

  • 8/14/2019 R Tutorial.doc

    49/59

    1?& mu D #.C,

    1a& muS not D #.C,

    To perform the hypothesis test we first assume that the null hypothesis is true and find the confidenceinterval around the assumed mean. 2ortunately, we can use the values from the previous step&

    > lull rull lull[1] 1&512%4%> rull[1] 1&$%0152> m[1] 1&$$3%7$>

    The sample mean lies outside of the assumed confidence interval so we can re0ect the null hypothesis.There is a low probability that we would have obtained our sample mean if the true mean really were #.C.

    Another way to approach the problem would be to calculate the actualpvalue for the sample mean thatwas found. :ince the sample mean is greater than #.C it can be found with the following code&

    > 2B(1-t((m-l(5&4))Gse,*+n-1))[1] 0&02%92539

    :ince thepvalue is @.$K which is less than #K we can re0ect the null hypothesis.

    +ote that there is yet another way to do this. The function t.test will do a lot of this wor for us.

    > t&test(lenine,mu + l(5&4),alternative + "t@&sie")

    Kne =amle t-test

    ata leninet + 2&2$41, * + 47, -value + 0&02%93alternative '!t'esis true mean is nt eAual t 1&%$%39995 ercent cn*ience interval1&709925 2&057431samle estimatesmean * 1&$$3%7$

    4ore information and a more complete list of the options for this command can be found using the helpcommand&

    > 'el(t&test)

    )he -o!er of the test

    'e now find the power of the test. To find the power we need to set a level for the mean and then find theprobability that we would accept the null hypothesis if the mean is really at the prescribed level. 1ere wewill find the power to detect a difference if the level were $. Three different methods are e!amined. The

  • 8/14/2019 R Tutorial.doc

    50/59

    first is a method that some boos advise to use if you do not have a noncentral ttest available. The seconddoes mae use of the noncentral ttest. 2inally, the third method maes use of a customi-ed R command.

    'e first find the probability of accepting the null hypothesis if the level really were $. 'e assume that thetrue mean is $ and then find the probability that a sample mean would fall within the confidence interval ifthe null hypothesis were true. eep in mind that we have to transform the level of $ by taing its logarithm.

    Also eep in mind that this is a twosided test&

    > t6e*t t.i't 1-(t(t,*+n-1,nc+s'i*t)-t(-t,*+n-1,nc+s'i*t))[1] 0&$371421>

    Again, we see that the power of the test is appro!imately B.$K. +ote that this result is slightly off from theprevious answer. This approach is often recommended over the previous approach.

    The final approach we e!amine allows us to do all the calculations in one step. t maes use of the noncentrality parameter as in the previous e!ample, but all of the commands are done for us.

    > @er&t&test(n+n,elta+l(7)-l(5&4),s+s,si&level+0&05, t!e+"ne&samle",alternative+"t@&sie",strict + ./)

    Kne-samle t test @er calculatin

    n + 4$ elta + 0&2595112

    s + 0&59$3$51 si&level + 0&05 @er + 0&$371421 alternative + t@&sie

    This is a powerful command that can do much more than 0ust calculate the power of a test. 2or e!ample itcan also be used to calculate the number of observations necessary to achieve a given power. 2or moreinformation chec out the help page, help(power.t.test).

  • 8/14/2019 R Tutorial.doc

    51/59

    Case Stud* II: A A5A -aper on Cholesterol

    'e loo at a paper that appeared in the 9ournal of the American 4edical Association and e!plore how touse R to confirm the results. t is assumed that you are familiar will all of the commands discussedthroughoutthis tutorial.

    . ,vervie! of the -aper@. )he )a+les. Confirming the pvalues in )a+le 7C. Confirming the pvalues in )a+le 8#. &inding the -o!er of the )est in )a+le 7>. (ifferences +* Race in )a+le 9$. Summar*

    ,vervie! of the -aper

    The paper we e!amine isTrends in Serum Lipids and Lipoproteins of Adults, 1960-2002 , 4argaret 8.;arroll, 4:1, 8avid A. Gacher, 48, aul 8. :orlie, h8, 9ames . ;leeman, 48, 8avid 9. Fordon, 48,h8, 4ichael 'ol-, 4:, :cott 4. Frundy, 48, h8, ;lifford G. 9ohnson, 4:1, ournal of the

    American 5edical Association, =ctober @, @??#ol @%C, +o. C, pp& $$$B. The goal is toconfirm the results and e!plore some of the other results not e!plicitly addressed in the paper. This paperreceived a great deal of attention in the media. A partial list of some of the articles is the following&

    &, %e!s

    !!!.medpagetoda*.com

    Argus 2eader

    )he ;lo+e and 5ail

    The authors e!amine the trends of several studies of cholesterol levels of Americans. The studies have beenconducted in %>?%>@, %BB%%C, %$>%B?, %BB%%C, and %%%@??@. :tudies of the studiesprevious to %%% have indicated that overall cholesterol levels are declining. The authors of this paper focus

    on the changes between the two latest studies, %BB%%C and %%%@??@. They concluded that betweencertain populations cholesterol levels have decreased over this time.

    =ne of the things that received a great deal of attention is the linage the authors drew between loweredcholesterol levels and increased use of new drugs to lower cholesterol. 1ere is a *uote from theirconclusions&

    he increase in the proportion of adults usin lipid'lowerin medication& particularly in older ae roups&

    likely contributed to the decreases in total and * cholesterol levels observed.

    1ere we focus on the confirming the results listed in Tables and C of the paper. 'e confirm the pvaluesgiven in the paper and then calculate the power of the test to detect a prescribed difference in cholesterollevels.

    )he )a+les

    Gins to the tables in the paper are given below. Gins are given to verbatim copies of the tables. 2or eachtable there are two lins. The first is to a te!t file displaying the table. The second is to a csv file to beloaded into R. t is assumed that you have downloaded each of the csv files and made them available.

    2in"s to the )a+les in the paper.

    http://www.cyclismo.org/tutorial/R/http://www.cyclismo.org/tutorial/R/http://www.cyclismo.org/tutorial/R/cholesterol.html#overviewhttp://www.cyclismo.org/tutorial/R/cholesterol.html#tableshttp://www.cyclismo.org/tutorial/R/cholesterol.html#confirmT3http://www.cyclismo.org/tutorial/R/cholesterol.html#confirmT4http://www.cyclismo.org/tutorial/R/cholesterol.html#powerT3http://www.cyclismo.org/tutorial/R/cholesterol.html#racehttp://www.cyclismo.org/tutorial/R/cholesterol.html#summaryhttp://jama.ama-assn.org/content/vol294/issue14/index.dtlhttp://jama.ama-assn.org/content/vol294/issue14/index.dtlhttp://www.foxnews.com/story/0,2933,172026,00.htmlhttp://www.medpagetoday.com/Cardiology/Dyslipidemia/tb/1915http://www.argusleader.com/apps/pbcs.dll/article?AID=/20051012/NEWS/510120318/1001http://www.theglobeandmail.com/servlet/story/RTGAM.20051012.wchol1012/BNStory/specialScienceandHealth/http://www.cyclismo.org/tutorial/R/http://www.cyclismo.org/tutorial/R/cholesterol.html#overviewhttp://www.cyclismo.org/tutorial/R/cholesterol.html#tableshttp://www.cyclismo.org/tutorial/R/cholesterol.html#confirmT3http://www.cyclismo.org/tutorial/R/cholesterol.html#confirmT4http://www.cyclismo.org/tutorial/R/cholesterol.html#powerT3http://www.cyclismo.org/tutorial/R/cholesterol.html#racehttp://www.cyclismo.org/tutorial/R/cholesterol.html#summaryhttp://jama.ama-assn.org/content/vol294/issue14/index.dtlhttp://www.foxnews.com/story/0,2933,172026,00.htmlhttp://www.medpagetoday.com/Cardiology/Dyslipidemia/tb/1915http://www.argusleader.com/apps/pbcs.dll/article?AID=/20051012/NEWS/510120318/1001http://www.theglobeandmail.com/servlet/story/RTGAM.20051012.wchol1012/BNStory/specialScienceandHealth/
  • 8/14/2019 R Tutorial.doc

    52/59

    Table te#t csv

    Table @ te#t csv

    Table te#t csv

    Table C te#t csv

    Table # te#t csv

    Table > te#t csv

    Confirming the pvalues in )a+le 7

    The first thing we do is confirm thepvalues. The paper does not e!plicitly state the hypothesis test, butthey use a twosided test as we shall soon see. 'e will e!plicitly define the hypothesis test that the authorsare using but first need to define some terms. 'e need the means for the %BB%%C and the %%%@??@studies and will denote them 4BBand 4%%respectively. 'e also need the standard errors and will denotethem :6BBand :6%%respectively.

    n this situation we are trying to compare the means of two e!periments and do not have matched pairs.'ith this in mind we can define our hypothesis test&

    4BB 4%%D ?,4BB 4%%not D ?,

    'hen we assume that the hypothesis test we calculate thepvalues using the following values&

    :ample 4ean D 4BB 4%%,

    :6 D s*rt(:6BB@N :6%%

    @).

    +ote that the standard errors are given in the data, and we do not have to use the number of observations tocalculate the standard error. 1owever, we do need the number of observations in calculating the pvalue.The authors used a ttest. There are complicated formulas used to calculate the degrees of freedom for thecomparison of two means, but here we will simply find the minimum of the set of observations and subtract

    one.

    'e first need to read in the data from "table.csv" and will call the variable "t." +ote that we use a newoption, row.namesD"group". This option tells R to use the entries in the "group" column as the row names.=nce the table has been read we will need to mae use of the means in the %BB%%C study ("t74.BB")and the means in the %%%@??@ study ("t74.%%"). 'e will also have to mae use of the correspondingstandard errors ("t7:6.BB" and "t7:6.%%") and the number of observations ("t7+.BB" and "t7+.%%").

    > t3

  • 8/14/2019 R Tutorial.doc

    53/59

    > t3&99[1] 203 203 203 202 1$3 200 212 215 204 195 202 204 1$3 194 203 21%223 217> i** i**[1] 1 3 1 2 -3 1 -1 1 10 10 3 3 0 -5 1 12 12 14> se se [1] 1&140175 1&0%3015 1&500000 1&500000 2&195450 2&193171 3&3%15473&0413$1 [9] 2&193171 3&32$%%3 1&131371 1&0%3015 2&140093 1&9$4943 2&12%0292&4$394$[17] 2&12%029 2&$%0070> e e[1] 7739 $$0$ 3%4$ 41%4 %73 %72 759 570 970 515 4090 4%43 9%0$%0 753[1%] 5%$ 945 552

    'e can now calculate the tstatistic. 2rom the null hypothesis, the assumed mean of the difference is -ero.

    'e can then use theptcommand to get thepvalues.

    > t t[1] 0&$7705$0 2&$221%2% 0&%%%%%%7 1&3333333 -1&3%%4%2% 0&4559%0$[7] -0&2974$21 0&32$79$0 4&559%075 3&00420$$ 2&%51%504 2&$221%2%[13] 0&0000000 -2&51$9%3% 0&4703%04 4&$3101$1 5&%443252 4&$949$52> t(t,*+e) [1] 0&$0975$$25 0&997%09%07 0&7474$%3$2 0&90$752313 0&0$%1250$90&%75717245 [7] 0&3$30$9952 0&%2$7$5421 0&999997110 0&99$%03$37 0&9959795770&997%04$09[13] 0&500000000 0&005975203 0&%$0$$3135 0&999999125 0&9999999$9

    0&999999354

    There are two problems with the calculation above. 2irst, some of the tvalues are positive, and for positivevalues we need the area under the curve to the right. There are a couple of ways to fi! this, and here we willinsure that the tscores are negative by taing the negative of the absolute value. The second problem is thatthis is a twosided test, and we have to multiply the probability by two&

    > t(-abs(t),*+e)[1] 1&902412e-01 2&390393e-03 2&52513%e-01 9&1247%9e-02 $&%12509e-02[%] 3&242$2$e-01 3&$30900e-01 3&71214%e-01 2&$$9$94e-0% 1&39%1%3e-03[11] 4&020423e-03 2&395191e-03 5&000000e-01 5&975203e-03 3&1911%9e-01[1%] $&74$%5%e-07 1&0959%%e-0$ %&4%2$14e-07> 2Bt(-abs(t),*+e)

    [1] 3&$04$23e-01 4&7$07$%e-03 5&050272e-01 1&$24954e-01 1&722502e-01[%] %&4$5%55e-01 7&%%1799e-01 7&424292e-01 5&7797$$e-0% 2&79232%e-03[11] $&040$45e-03 4&7903$2e-03 1&000000eC00 1&195041e-02 %&3$2337e-01[1%] 1&749731e-0% 2&191933e-0$ 1&2925%3e-0%>

    These numbers are a close match to the values given in the paper, but the output above is hard to read. 'eintroduce a new command to loop through and print out the results in a format that is easier to read.

  • 8/14/2019 R Tutorial.doc

    54/59

    Theforloop allows you to repeat a command a specified number of times. 1ere we want to go from , @, ,..., to the end of the list ofpvalues and print out the group and associatedpvalue&

    > *r (I in 1lent'()) T cat("-value *r ",r@&names(t3)[I]," ",[I],"Un")

    V-value *r all 0&3$04$23-value *r 20 0&0047$07$%-value *r men 0&5050272-value *r m20 0&1$24954-value *r m20-29 0&1722502-value *r m30-39 0&%4$5%55-value *r m40-49 0&7%%1799-value *r m50-59 0&7424292-value *r m%0-74 5&7797$$e-0%-value *r m75 0&00279232%-value *r @men 0&00$040$45-value *r @20 0&0047903$2-value *r @20-29 1

    -value *r @30-39 0&01195041-value *r @40-49 0&%3$2337-value *r @50-59 1&749731e-0%-value *r @%0-74 2&191933e-0$-value *r @75 1&2925%3e-0%>

    'e can now compare this to Table (given in the lin above) and see that we have good agreement. Thedifferences come from a round off errors from using the truncated