r Programming Life Sciences Aug 2009

Embed Size (px)

Citation preview

  • 8/10/2019 r Programming Life Sciences Aug 2009

    1/120

    R Programming for Life Scientists

    Version 2.0Raymond R. Balise, Ph.D.

    Health Research and PolicySpectrum

  • 8/10/2019 r Programming Life Sciences Aug 2009

    2/120

    Roadmap

    What makes R different for the rest? Setting up R

    Types of data Working with collections of data Importing and exporting data

    Writing functions Graphics

  • 8/10/2019 r Programming Life Sciences Aug 2009

    3/120

    When to Use R

    Shoestring budget Cutting edge statistics

    Developing your own or fine-tuning existingmethods Local expertise

  • 8/10/2019 r Programming Life Sciences Aug 2009

    4/120

    Programming Languages

    Procedural languages C, Fortran, Cobol, Basic use a model where the logic flows from the top of

    the page to the bottom with calls to gotosubroutines as needed

    It is hard to encapsulate the code.

    Object oriented languages C++, Visual Basic, JAVA involves creating objects and then operating on them

  • 8/10/2019 r Programming Life Sciences Aug 2009

    5/120

    R is Object Oriented (OO)

    You create objects vector of numbers, a graphic, etc.

    You call methods/functions to operate on the

    objects. Working with an OO language requires you to

    learn about special methods to create, access,modify, or destroy objects and their properties. R hides these processes. It helps a lot if you want to write new statistics and

    methods and is required for making new packages.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    6/120

    OO Example

    With R you write code in the editor which I willshow you in a minute.

    You can create an object which holds a bunch of

    numbers (a vector, if you remember math) You can then use (aka call ) a function (aka

    method ) to operate on the object. The summary() function

    Create and display a numeric summary object

    The plot() function Create and display a graphic summary object

  • 8/10/2019 r Programming Life Sciences Aug 2009

    7/120

    Make theages object

    Call thesummaryfunction.

    Call theplot

    function.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    8/120

    But wait theres more!

    There is a lot of functionality built into R. Itships with libraries that do many differenttasks. And you can download more.

    Map most of theUSA.

    Activate the mapdatasets and

    functions.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    9/120

    But hold on. There is MORE! You can add options to the function calls to

    make them do fancy things like color. Or you can have one function act on the

    output of another function.

    And you can save output as objects!

  • 8/10/2019 r Programming Life Sciences Aug 2009

    10/120

    Important Objects

    Vectors are lists of numbers. Dataframes are like database or spreadsheets.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    11/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    12/120

    Where to Get R

    R has two main websites. One describes the project:http://www.r-project.org/

    The other has most of the stuff you want to

    download:http://cran.r-project.org/

    Because the R project has people working all overthe globe, the software download site is mirrored

    everywhere. The closest mirror is USA CA1 (aka UCBerkeley).

    http://www.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/
  • 8/10/2019 r Programming Life Sciences Aug 2009

    13/120

    http://cran.cnr.berkeley.edu/

    There is an R installer for all the commonoperating systems:

    cran.cnr.berkeley.edu/bin/windows/base/ cran.cnr.berkeley.edu/bin/macosx/ cran.cnr.berkeley.edu/bin/linux/

    Each is basically self explanatory.

    http://cran.cnr.berkeley.edu/bin/windows/base/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/linux/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/macosx/http://cran.cnr.berkeley.edu/bin/windows/base/
  • 8/10/2019 r Programming Life Sciences Aug 2009

    14/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    15/120

    Installing on Windows

    Double click the installer and just push nextuntil you get to this screen.

    Specify that youwant to docustomized startup.

    This will let you setup R to work withother programsnicely.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    16/120

    Customize

    Use these options, then hit Next> a bunch.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    17/120

    help.start() and push enter to start the help. q() and push enter to quit but dont yet.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    18/120

    GUIUse the built in

    editor.

    Save or restore allthe objects in use.

    Save or reload thecode from theconsole.

    Keep all the text inthe console for the

    session.

    Set the working

    directory to saveobjects.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    19/120

    GUIEdit existing data.

    Tweak theappearance of the

    console.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    20/120

    Rprofile.site

    If you have instructions that you always wantrun when R starts up, you can include them inthe Rprofile.site file:

  • 8/10/2019 r Programming Life Sciences Aug 2009

    21/120

    GUI

    Commoncommands.

    Show the add onpackages currently

    accessible.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    22/120

    Packages in R

    User-supplied packages are typically found atone of three places: CRAN for all kinds of stuff

    Omegahat for web-based statistics Bioconductor for genomic analysis

    R packages update often.

    Your colleagues will recommend task-specificpackages. Rcmdr is my favorite.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    23/120

    GUIUse a previously

    downloaded package.I type library(name)

    instead.USA (CA1) is closest

    to Stanford.

    Choose which set ofpackages to look at.

    See the HUGE list ofpackages.

    Update often!

  • 8/10/2019 r Programming Life Sciences Aug 2009

    24/120

    GUI

    This is useful.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    25/120

    HTMLhelp

    This is usefulbut not

    Google.

    This will not findinformation if youhave not installedthe packages.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    26/120

    Rseek.org is Google-driven

    I highly recommend it.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    27/120

    Mac Quick HelpSearch help for the

    word "map".Search for details

    on a function if thepackage is loadedand you know thefunctions name.

    h h l f h

  • 8/10/2019 r Programming Life Sciences Aug 2009

    28/120

    Windows Quick HelpSearch help for the

    word "map".

    Search help for thefunction named"map".

    Load the package.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    29/120

    Mac Install

    Download and double click the dmg file.

    Click customize andmake sure Tcl/Tk is

    checked on.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    30/120

    X11

    Some packages for R on the Mac (like Rcmdr)require X11 to be installed. I think it is part of the standard Leopard

    installation but was an option with Tiger. If youneed it, try to install it off of the DVD that camewith your machine because people have reportedusing the dmg files from Apple.com.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    31/120

    X11 and Add-on Packages To get add onpackages, use thismenu.

    You can click hereto make sure X11

    works.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    32/120

    Getting or Updating Packages

    ClickGet List , click the package name, be sureinstall dependencies is checked on, then clickinstall .

  • 8/10/2019 r Programming Life Sciences Aug 2009

    33/120

    Instead of Point and Click

    You can also run this code to have Mac orWindows R download a list of packages:

    usefulPackages = c("car", "foreign", "hexbin", "gdata","ggplot2", "gmodels", "gplots", "Hmisc", "reshape","Rcmdr")

    install.packages(usefulPackages, dependencies = TRUE)

    Be sure to take note of any packages that do not install.

    marray , affy, Biobase , Rgraphviz were not available

  • 8/10/2019 r Programming Life Sciences Aug 2009

    34/120

    I suggest you install the Rcmdr package firstthing. Use the Install packages option on the package

    menu to download Rcmdr To make it available for your R session type:

    library(Rcmdr) CAPITALIZATION MATTERS! The first time you run it, it will ask you if it can

    download additional packages.

    Your First Package

  • 8/10/2019 r Programming Life Sciences Aug 2009

    35/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    36/120

    If you are on Windows you

    can directly import Excel.

    On a Mac you cannot directly import

    from Excel.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    37/120

    Hate Typing?

    Tab is your fiend. It will auto-complete if itcan or give you a list of functions that matchwhat you have typed. It woks very well on theMac. In Windows sometimes you need to typetab twice.

    In Windows if you type tab after a ( it displaysoptions for the function or they just appear inthe Mac.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    38/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    39/120

    Data Set Objects

    Vectors A bunch of data in a single row or column All of the same type

    Matrix

    A row and column arrangement of data All of the same type Data frame

    A row and column arrangement of data Columns are of different types

    List Very free-form structure A grouping of different types of data

    Like a good spreadsheetor relational database file

  • 8/10/2019 r Programming Life Sciences Aug 2009

    40/120

    Types of Data Vectors

    Numeric Integer, real, and complex are different types but

    you will not need to pay attention to the details

    NA means missing NAN means not a number

    String Characters of the alphabet

    Logical TRUE, FALSE or NA

  • 8/10/2019 r Programming Life Sciences Aug 2009

    41/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    42/120

    Making Vectors With c()

    c stands for concatenateages = c(9, 11, 40, 41) ; agesstooges = c("Larry", "Moe", "Curly", "Shemp"); stooges

  • 8/10/2019 r Programming Life Sciences Aug 2009

    43/120

    Getting Details

    You can use is functions and length to getdetails on a vector.is.vector(ages)

    is.numeric(ages)is.logical(ages)length(ages)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    44/120

    You can add one to all four ages.ages + c(1,1,1,1)

    If you provide the scalar integer, R willtemporarily vectorize the 1 by recycling thatvalue to match the length of the ages vector.ages + 1

    It will recycle a series also.agesages + c(1,2)

    Recycling and Vectorizing

  • 8/10/2019 r Programming Life Sciences Aug 2009

    45/120

    Naming Parts of a Vector

    You can assign names to the elements of a vector.This allows later access to the elements using thenames instead of the position.names(ages) = stoogesages

    To erase them:names(ages) = NULL; ages

    Notice what happens when the lengths differ:stooges= c("Larry", "Moe", "Curly")names(ages) = stoogesages

  • 8/10/2019 r Programming Life Sciences Aug 2009

    46/120

    Attributes

    When you add names to things (objects) theyacquire or change their names attribute. attributes(ages)

    When you strip off the names, the vector isleft with no attributes.names(ages) = NULLattributes(ages)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    47/120

    A data frame is an object with manyattributes.

    R ships with a lot of datasets if you want one help.start()

    Click packages then datasets.esoph?esophattributes(esoph)

    Complex Objects

  • 8/10/2019 r Programming Life Sciences Aug 2009

    48/120

    Getting at Parts of a Vector

    Specify the element number.heyMoe = ages[2] ; heyMoe

    Specify to drop everything except the elementnumber.ages[c(-1, -3, -4)]

    Specify a list with TRUE and FALSEages[c(FALSE, TRUE, FALSE, FALSE)]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    49/120

    Getting Parts with Names

    ages = c(9, 11, 40, 41) ; agesnames(ages) = c("Larry", "Moe", "Curly", "Shemp")ages

    Specify the name.heyMoe = ages["Moe"]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    50/120

    Duplicate Names

    That code only returns the first one if thereare duplicates.names(ages)[4] = "Moe"

    agesheyMoe = ages["Moe"]heyMoe

    Gives all if duplicates names(ages) %in% "Moe"ages[names(ages) %in% "Moe"]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    51/120

    Parts of a Data Frame

    You can select columns of a data frame justlike you selected elements from a vector.booze = esoph["alcgp"]

    is.data.frame(booze)esoph[2]esoph[c(4,5)]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    52/120

    Choosing Records

    If you put a single item or series inside of thesquare brackets, R thinks you are requestingcolumns.

    If you want to get access to specific rows, youinclude a comma after the rows. blah[rows, columns]

    esoph[ 1 , ]esoph[ 1:3 , ]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    53/120

    Smarter Access to a Vector

    You can use logic checks to find the recordnumbers in a vector which meet your criteria.ages < 21

    which(ages < 21)

    You can then subset down your data to therecords of interest using the [ ] subset

    operator.ages[which(ages < 21)]ages[ages < 21]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    54/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    55/120

    Subset a Data Frame

    Recall that you can select rows withframeName[rows,columns] and if you do notinclude a comma, all records are chosen.

    which(esoph$ncases > 0) gives you a list ofrecords which adhere to that rule. Therefore,the code below gives you a subset

    esoph[ which(esoph$ncases > 0) , ]oresoph[esoph$ncases > 0 , ]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    56/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    57/120

    Choosing Values

    If you need specific values, you can use the & (and) or the | (or) operators to get theordered set of TRUE and FALSE values.ages > 21 & ages < 41

    ! means not!(ages > 21 & ages < 41)

    Notice that it is applying the one logic checkto the vector of ages. How does it do that?

  • 8/10/2019 r Programming Life Sciences Aug 2009

    58/120

    Math on Data Frame Columns

    You have seen how to do scalar and vectoralgebra. Algebra on a data frame is easy.names(esoph)esoph$total=esoph$ncases + esoph$ncontrols

    To see the end of the data frame, use tail()tail(esoph)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    59/120

    Comparing Against Vectors

    This one uses

    recycling andgives wronganswers.

    What happens when you try to compare a vectorto a set of things?gender = c(NA, "Male", "Female", "Blue", "Female")gender == "Male" | gender == "Female"gender == c("Male", "Female")

    R recycles the shorter vector to be the longerlength, then does the comparison. Use the %in%

    operator if you want to compare as if you wrote aseries of or statements.gender %in% c("Male", "Female")

  • 8/10/2019 r Programming Life Sciences Aug 2009

    60/120

    Categorical Variables

    R makes a distinction between variables holdinga bunch of characters from the alphabet andvariables holding categorical information. If

    you have a classification/categorical variable,you want R to treat it as a factor or an orderedfactor. Typical factors are treatment or gender.dose = c("low", "placebo", "high", "low")

    dosetypeof(dose)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    61/120

    Factors

    To convert a character variable to a factor, use theas.factor function.doseF = as.factor(dose)

    typeof(doseF)class(doseF)

    Behind the scenes, the character variable isconverted into numbers and the numbers are

    given character strings to display. In modern R the levels of the factor are ordered

    alphabetically and the first one is representedwith the digit 1, the second is 2, etc.

    There are is. or as. predicate functionsto check object types or convert

    between types of objects.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    62/120

    Comparing Factors

    Notice wrong answer thanks to recycling.

    You can compare a factor vs. a constant value.doseF == "high"as.integer(doseF) == 1

    Or you can compare vs. vectors (CAREFULLY).doseF == c("high", "low")doseF %in% c("high", "low")

    R will stop you from comparing factors that havedifferent categories.doseF2 = as.factor(c("blah", "placebo", "high", "low"))doseF == doseF2

  • 8/10/2019 r Programming Life Sciences Aug 2009

    63/120

    Recoding Factors

    Often you will want to regroup factor levels.amount=as.factor(c("placebo", "10mg", "5mg", "10mg"))levels(amount)

    regroup = list(none="placebo", some=c("5mg", "10mg"))levels(amount) = regroupamount

    noneplacebo

    some5mg

    10mg

  • 8/10/2019 r Programming Life Sciences Aug 2009

    64/120

    Numeric Factors

    If you have numeric factors, be carefulconverting from factors back to numbers.ID = c(1000, 1000, 1001, 2)

    IDf = factor(ID)as.integer(IDf)levels(IDf)

    numbersAgain = as.numeric(levels(IDf))[IDf]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    65/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    66/120

    Easier Recoding

    Other packages like car have functions to recode:library(car)newAge2=recode(ages, ' 1:21="Young"; else= "Old" ')

    newAge2detach("package:car")

  • 8/10/2019 r Programming Life Sciences Aug 2009

    67/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    68/120

    Attaching Data Frames

    People who really hate typing attach dataframes so they can refer to them with shortnames.bmi = women$weight / women$height ^2 * 703

    Instead you can attach the women data frameand an easier formula write:

    attach(women)search()bmi = weight / height ^2 * 703; bmi

    detach(women)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    69/120

    Keeping Track

    You can see what datasets are in each of thework environments/packages with the liststuff function ls().rm(list=ls(all=TRUE))ls()search()

    attach(women)ls()search()

    head(women); head(height)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    70/120

    datasets

    women

    women

    heightweight

    .GlobalEnv

    Look 1 st Look 2 nd Look 3 rd

  • 8/10/2019 r Programming Life Sciences Aug 2009

    71/120

    Adding a Variable & Making a DF

    women$bmi = weight / height ^2 * 703; bmihead(women)ls()

    datasets

    women

    women

    heightweight

    .GlobalEnv

    women

    Look 1 st Look 2 nd Look 3 rd

    The data frame

    with bmiand the data

    frame withoutbmi

  • 8/10/2019 r Programming Life Sciences Aug 2009

    72/120

    Making a Data Frame

    Frequently you will want to make data frames foranalysis with Rcmdr. Use the data.frame()command:

    attach(sleep)pair = data.frame(extra[group=="A"], extra[group=="B"])

  • 8/10/2019 r Programming Life Sciences Aug 2009

    73/120

    Using Rcmdr for a paired t-testClick here.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    74/120

    Loading Text Data into R

    Reading text files:fakeAlleles=read.table("c:\\blah\\fakeAlleles.txt",

    header=TRUE) See if it worked:

    fakeAllelesnames(fakeAlleles)summary(fakeAlleles)fakeAlleles$dude = as.character(fakeAlleles$dude)

    fakeAlleles A better option:

    fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt", header =TRUE, colClasses = c("character", "factor","factor"))

  • 8/10/2019 r Programming Life Sciences Aug 2009

    75/120

    Other Text Formats

    Other text reading methods:read.csv = coma separated values read.csv2 = semicolon delimited files read.delim = read tab delimited files

    read.fwf = read fixed width format files Use same options as read.table If the data has bad or no column headings you may

    also want to include:read.table ( stuff, col.names = c("name1", "name2") )

    To prevent characters from coming in as factors:options(stringsAsFactors = FALSE)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    76/120

    Data Frames

    The data imported into a data frame.class(fakeAlleles)

    A data frame really is a list of vectors where thevectors are all the same length.as.list(fakeAlleles)

    To select a column you specify the data frame $ variable name.theDudes = fakeAlleles$dude

    All the stuff you saw for logic checks on vectorscan be used on the parts of a data frame.fakeAlleles$allele1 == "A"

  • 8/10/2019 r Programming Life Sciences Aug 2009

    77/120

    Subsetting Vectors (again)

    Recall that you can subset using the [ ] operator:ages = c(9, 11, 40, 41)

    heyMoe = ages[2]ages

  • 8/10/2019 r Programming Life Sciences Aug 2009

    78/120

    Subsetting Data Frames

    Parts (subsets) of data frames are referencedby "column numbers comma row numbers": The first record: fakeAlleles[1, ]

    The 2 nd and 3 rd columns: fakeAlleles[ , c(2,3)] The genotype for record 6: fakeAlleles[6, c(2,3)]

    or by names:fakeAlleles[, c("allele1", "allele2")]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    79/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    80/120

    Getting Counts with Rcmdr

  • 8/10/2019 r Programming Life Sciences Aug 2009

    81/120

    Subsetting Using Logic

    You can use logic checks to subset:fakeAlleles$allele1 == "A" & fakeAlleles$allele2 =="A"fakeAlleles[ fakeAlleles$allele1 == "A" &

    fakeAlleles$allele2 =="A", ]

  • 8/10/2019 r Programming Life Sciences Aug 2009

    82/120

    Importing From Excel

    If you have PERL on your machine, you canuse the read.xls() function in the gdata libraryto easily get data out of Excel and into a data

    frame. Mac has PERL Windows

    http://www.activestate.com/activeperl/

    d l

    http://www.activestate.com/activeperl/http://www.activestate.com/activeperl/
  • 8/10/2019 r Programming Life Sciences Aug 2009

    83/120

    Using read.xls

    Windows:library(gdata)sleepy = read.xls("c:\\blah\\sleep.xls")

    Mac:library(gdata)read.xls("/users/balise/desktop/sleep.xls")

    Its that easy Behind the scenes it is convertingthe xls file into a csv so you can use the textimporting options.

    Do summary() on the data frame and notice whathappens to the missing value.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    84/120

    RODBC

    ODBC is a language/convention for accessingdatabases. R allows you to use ODBCconnections to burrow directly into databases

    and other data containers like Excel.library(RODBC)channel

  • 8/10/2019 r Programming Life Sciences Aug 2009

    85/120

    SQL

    If you have to learn one programminglanguage, learn SQL. With it you can manipulate data stored in nearly

    every commercial database. You can aggregate, subset and modify data. It is well implemented inside of both R and SAS.

    SQL with R is nicely documented in Spector's(2008) Data Manipulation with R . It is a mustown for people who want to learn R.

    E i T Fil

    http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L254617&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L254617&Search_Code=CMD*&CNT=10&v1=1
  • 8/10/2019 r Programming Life Sciences Aug 2009

    86/120

    Exporting Text Files

    R can write objects full of data, including dataframes, into text files. By default, it will quote the character string and fill

    in the letters NA where there were originallymissing values.

    This code exports back to the original

    appearance.write.table(sleepy, file = "c:\\blah\\exported.tab",sep ="\t", quote = FALSE, na ="")

  • 8/10/2019 r Programming Life Sciences Aug 2009

    87/120

    Office 2007 Excel

    ODBC connection1. Control Pannels,2. Double click Adminstrative Tools3. Double click Data Sources (ODBC)

    4. On the USER DNS tab choose ADD5. Click Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)6. Give the connection a name and browse to the file.

    Jot down the name of the connection for the R code.

    U i ODBC i

  • 8/10/2019 r Programming Life Sciences Aug 2009

    88/120

    Using an ODBC connection

    Once the ODBC connection is set-up use code like this:library(RODBC)connection = odbcConnect("sleepODBC")dataFromODBC= sqlFetch(connection, "Sheet1")odbcClose(connection)

    C i P

  • 8/10/2019 r Programming Life Sciences Aug 2009

    89/120

    Creating Programs

    You can write line-by-line instructions in the Rconsole, use the editors built into R, or use a thirdparty editor (like Tinn-R for windows or JGR).

    Console Type history() to see the lines you have submitted

    recently and then save to a file and re-run it later ifneeded.

    Built-in Editor

    Mac: Click the blank page at top of the console Windows: File > New Script

    Wi d Ti R Edi

  • 8/10/2019 r Programming Life Sciences Aug 2009

    90/120

    Windows Tinn-R Editor

    http://www.sciviews.org/Tinn-R/index.html

    OO P i i R

    http://www.sciviews.org/Tinn-R/index.htmlhttp://www.sciviews.org/Tinn-R/index.htmlhttp://www.sciviews.org/Tinn-R/index.htmlhttp://www.sciviews.org/Tinn-R/index.html
  • 8/10/2019 r Programming Life Sciences Aug 2009

    91/120

    OO Programming in R

    OO programming requires objects classes

    describe specific properties for groups of objects inheritance

    classes related to eachother (derived from other classes) have relatedproperties

    polymorphism the same function name applied to different classes does different things

    R vs. JAVA: R typically has separate classes for actions instead ofbundling them with the data structures

    JAVA Animal -> domesticated -> dog (walks) R

    Animal -> domesticated -> dog Movement -> Walks

    P l hi i F

  • 8/10/2019 r Programming Life Sciences Aug 2009

    92/120

    Polymorphism is Fun

    plot does different things depending on thefunction arguments:plot(sleepy)

    plot(sleepy$extra)plot(sleepy$baseline, sleepy$extra, sleepy$group)

    Take a look at how it works:isS4(plot)methods(plot)getAnywhere(plot.factor)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    93/120

    W iti g F ti

  • 8/10/2019 r Programming Life Sciences Aug 2009

    94/120

    Writing Functions

    You can easily write functions, but notice that the lastthing calculated is returned:MandM = function(x){mean(x); median(x)}MandM(sleepy$extra) # returns only the median

    Store the values you want into a list:MandM = function(x) {blah = list(theMean=0, theMedian=0)blah$theMean = mean(x)blah$theMedian = median(x)return (blah)

    }MandM(sleepy$extra)

    Oth A g t

  • 8/10/2019 r Programming Life Sciences Aug 2009

    95/120

    Other Arguments

    MandM(sleepy$baseline) It points out that we need to deal with missing

    values. Look up mean and median and you will

    see they allow the na.rm parameter to determineif missing values are dropped. Using MandM(sleepy$baseline, na.rm=TRUE)

    does not work because the parameter list doesnot allow it. We want to allow that parameter tobe passed along. So rewrite the function.

    M and M Again

  • 8/10/2019 r Programming Life Sciences Aug 2009

    96/120

    M and M Again

    Recall that an in the argument list means"other stuff".MandM = function(x, ...) {

    blah = list(theMean=0, theMedian=0)blah$theMean = mean(x , ...)blah$theMedian = median(x, ...)return (blah)

    }MandM(sleepy$extra)MandM(sleepy$baseline, na.rm=TRUE)

    Appling Your Function

  • 8/10/2019 r Programming Life Sciences Aug 2009

    97/120

    Appling Your Function

    R does allow you to write loops to iterate overrecords or variables but if you are not writingnovel math functions, they can generally be

    avoided. R will try to vectorize and process:

    MandM(c(sleepy$baseline,sleepy$extra))

    Use sapply to apply a function to a data frame:sapply(sleepy, MandM, rm.na=TRUE)

    Better M and M

  • 8/10/2019 r Programming Life Sciences Aug 2009

    98/120

    Better M and M

    MandM = function(x, ...) {blah = list(theMean=0, theMedian=0)if(is.numeric(x)== TRUE) {

    blah$theMean = mean(x , ...)blah$theMedian = median(x, ...)}return (blah)

    }sapply(sleepy, MandM, na.rm=TRUE)

    Yummy M and M

  • 8/10/2019 r Programming Life Sciences Aug 2009

    99/120

    Yummy M and M

    MandM = function(x, ...) {blah = list(theMean=NaN, theMedian=NaN)if(is.numeric(x)== TRUE) {blah$theMean = mean(x , ...)blah$theMedian = median(x, ...)}return (blah)

    }MandM(sleepy$extra)MandM(sleepy$baseline, na.rm=TRUE)

    sapply(sleepy, MandM, na.rm=TRUE)

    Writing Novel Functions

  • 8/10/2019 r Programming Life Sciences Aug 2009

    100/120

    Writing Novel Functions

    Look hard on rseek.org before you reinventthe wheel.

    R syntax is very similar to C.

    Select/Case logic is different (R short-circuits) . The R Book by Crawley is too big to buy for

    just this topic but it is good for syntax. Get itfrom the library and read the early chapters.

    The final chapter of Spector has a fewwonderful pages.

  • 8/10/2019 r Programming Life Sciences Aug 2009

    101/120

    Destroying Efficiency

  • 8/10/2019 r Programming Life Sciences Aug 2009

    102/120

    Destroying Efficiency

    A matrix of data is really a vector with row andcolumn attributes added to it. This has profoundspeed issues if you add to the size of a matrix

    because the data has to be shifted all over theplace. If you plan on writing your own functions to

    manipulate matrices, build an empty matrix ofthe maximum size (or guess bigger) rather thanusing the functions to add rows or columns.

    Writing Efficient Code

  • 8/10/2019 r Programming Life Sciences Aug 2009

    103/120

    Writing Efficient Code

    R has decent tools for profiling code. The Rprof and summaryRprof functions will

    help you figure out what is bogging down your

    code.Rprof()MandM(rnorm(1000000))

    Rprof(NULL)summaryRprof()

  • 8/10/2019 r Programming Life Sciences Aug 2009

    104/120

    Debugging in R

  • 8/10/2019 r Programming Life Sciences Aug 2009

    105/120

    Debugging in R

    See Chapter 9 in Gentleman's book. The browser() function can be put inside a function to

    pause execution and see what is going on. The codetools package is great for tweaking big

    functions:findLocals(), findGlobals(),

    shows you if variables and functions originate inside of afunction

    checkUsage(), and checkUsagePackage() shows you what variables are modified or not touched in a

    function

    Creating Graphs

  • 8/10/2019 r Programming Life Sciences Aug 2009

    106/120

    Creating Graphs

    Basic plots are easy but tweaking them forpublications can be rough because thedocumentation on the function arguments is

    appalling. Data Analysis and Graphics Using R by John

    Maindonald and John Braun is extremely useful. There are myriad graphics built into the core of R

    plus more in the packages.addictedtor.free.fr/graphiques/thumbs.php

    Test Scores

    http://addictedtor.free.fr/graphiques/thumbs.phphttp://addictedtor.free.fr/graphiques/thumbs.phphttp://addictedtor.free.fr/graphiques/thumbs.phphttp://addictedtor.free.fr/graphiques/thumbs.php
  • 8/10/2019 r Programming Life Sciences Aug 2009

    107/120

    Test Scores

    scores = read.table("c:\\blah\\walkerScores.txt", header = TRUE)rapply(scores, class)scores$CENTER = as.factor(scores$CENTER)scores$PAT = as.character(scores$PAT)

    rapply(scores, class)scores$isSick = ifelse(scores$SCORE > 0, 1, 0);library(car)(scores$SEV = with(scores, recode(SCORE, '0 = "None" ;1:30 =

    "Mild"; 31:69 = "Moderate"; 70:100 = "Severe"; else = "BADDATA"')))

    (scores$SEV = factor(scores$SEV, levels = c("None", "Mild","Moderate", "Severe"), ordered = TRUE));

    Common Plots are Easy

  • 8/10/2019 r Programming Life Sciences Aug 2009

    108/120

    Common Plots are Easy

    attach(scores) #to avoid typing scores$plot(SEV, main = "MainTitle", xlab = "xlab", ylab =

    "ylab")

    plot(SCORE)hist (SCORE)boxplot(SCORE)boxplot(SCORE ~ SEX, ylim = c(0,100))detach(scores)

    Graphics Tweaks

  • 8/10/2019 r Programming Life Sciences Aug 2009

    109/120

    Graphics Tweaks

    -3 -2 -1 0 1 2 3

    0 . 0

    0 . 1

    0 . 2

    0 . 3

    0 . 4

    Density = dnorm

    z

    P r o

    b a

    b i l i t y d e n s

    i t y

    -3 -2 -1 0 1 2 3

    0 . 0

    0 . 2

    0 . 4

    0 . 6

    0 . 8

    1 . 0

    Probability = pnorm

    z

    P r o

    b a

    b i l i t y

    0.0 0.2 0.4 0.6 0.8 1.0

    - 2

    - 1

    0

    1

    2

    Quantiles = qnorm

    p

    Q u a n

    t i l e ( Z )

    Random numbers = rnorm

    z

    f r e q u e n c y

    -4 -2 0 2

    0

    5 0

    1 0 0

    1 5 0

    2 0 0

    mfrow is used to setnumber of rows and

    columns of graphics ona page

    Strip Charts for Small Datasets

  • 8/10/2019 r Programming Life Sciences Aug 2009

    110/120

    Strip Charts for Small Datasets

    par(cex = 1.5) # big font with(Gad, stripchart(HAMA ~ DOSEGRP, xlab =

    "HAMA", pch = 16))

    20 25 30 35

    H I

    L O

    P B

    HAMA

    3 Languages for the Price of 1

  • 8/10/2019 r Programming Life Sciences Aug 2009

    111/120

    3 Languages for the Price of 1

    The graphics I have shown use the classicgraphic methods.

    There are trellis plots from the lattice package

    that split the data into multiple panesautomatically.

    ggplot2 uses a "grammar of graphics"

    approach (like SPSS).

    Dont play with pie!

  • 8/10/2019 r Programming Life Sciences Aug 2009

    112/120

    Don t play with pie!

    library(lattice) trellis.par.set(list(fontsize=list(points=20))) trellis.par.set(list(fontsize=list(text=25))) dotplot(table(Gad$DOSEGRP), xlim = c(-1, 21))

    Freq

    HI

    LO

    PB

    0 5 10 15 20

    HI

    LO

    PB

    DOSEGRP

    The lattice package makestrellis graphics (I didnt makeup these names!).

    EE

    8 1216

    EE EE

    8 1216

    EE EE

    8 1216

    EE EE

    8 1216

    EE EE

  • 8/10/2019 r Programming Life Sciences Aug 2009

    113/120

    Compression Ratio

    N O x

    ( m i c r o g r a m s

    / J )

    1

    2

    3

    4

    8 1216

    EE EE

    8 1216

    EE EE

    8 1216

    EE EE

    8 1216

    EE EE

    8 1216

    EE

    Typical lattice plot with

    banding to showsubsets

  • 8/10/2019 r Programming Life Sciences Aug 2009

    114/120

  • 8/10/2019 r Programming Life Sciences Aug 2009

    115/120

    Basic plot + geometricdetails + adding details+ adding more details +

    yet more details

    qplot(carat, price, data = diamonds, geom= c("point", "smooth"))

    Use Rcmdr (R Commander)

  • 8/10/2019 r Programming Life Sciences Aug 2009

    116/120

    Rcmdr has A LOT of great graphics built intothe point and click interface.

    library(Rcmdr)

    Look up my short course (5 talks) coveringbasic statistics to see how to code manygraphics.

    www.stanford.edu/~balise/HowToDoBiostatistics.htm

    Use Rcmdr (R Commander)

    You are Going to Need More Help

    http://www.stanford.edu/~balise/HowToDoBiostatistics.htmhttp://www.stanford.edu/~balise/HowToDoBiostatistics.htm
  • 8/10/2019 r Programming Life Sciences Aug 2009

    117/120

    Data Manipulation with R by Spector. A must-have book on how to read and write data with or without SQL,

    manipulate data with R, aggregate data, and reshape datasets easily.

    R Programming For Bioinformatics by Gentleman. A very good intermediate level book on how R object-oriented

    programming really works.

    The R Book or Statistical Computing by Crawley. These have nicely written intermediate level statistics. But they are highly redundant across the two books.

    Redundant

    http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L252108&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L252112&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L260334&Search_Code=CMD*&CNT=10&v1=1http://lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=0359+L254617&Search_Code=CMD*&CNT=10&v1=1
  • 8/10/2019 r Programming Life Sciences Aug 2009

    118/120

    Biostatistics

  • 8/10/2019 r Programming Life Sciences Aug 2009

    119/120

    Biostatistics

    John Fox, the guy who made Rcmdr, is anexcellent author and he provides an R basedsupplement for his superb statitics book.

    Spectrum

  • 8/10/2019 r Programming Life Sciences Aug 2009

    120/120

    Spectrum

    If you are doing biomedical research and havequestions we are here to help. Study design

    Analysis plan Power and sample size calculation (Limited availability help with SAS and R code)

    med.stanford.edu/spctrm/biostatistician.html

    http://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.htmlhttp://med.stanford.edu/spctrm/biostatistician.html