62
A very brief introduction to R Erjia Yan January 25, 2010 1

A very brief introduction to R - Drexel University

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A very brief introduction to R - Drexel University

A very brief introduction to R

Erjia Yan

January 25, 2010

1

Page 2: A very brief introduction to R - Drexel University

Outline

• Introduction

• Data input

• Date type

• Functions

• Graphics

• Resources

• Examples

2

Page 3: A very brief introduction to R - Drexel University

What is R?

• Statistical computer language;

• Variety of statistical and numerical methods.

• Easy to build your own functions;

• High quality visualization and graphics tools.

• Has abundant free packages; and

• Extensive help files.

3

Page 4: A very brief introduction to R - Drexel University

Statistics R can do

• Factorial methods

• Clustering

• Probability Distributions

• Statistical Tests

• Regression

• Generalized Linear Models

• Mixed models, etc.

4

Page 5: A very brief introduction to R - Drexel University

Factorial methods

• Principal Component Analysis (PCA)• Distance-based methods• SOM (Self-Organizing Maps)• Simple Correspondance Analysis (CA)• Multiple Correspondance Analysis• Log-linear model (Poisson Regression)• Discriminant Analysis• Canonical analysis• Kernel methods• Neural networks

5

Page 6: A very brief introduction to R - Drexel University

Clustering

• Non-hierarchical clustering (k-means)

• Hierarchical Classification (dendogram)

• Density estimation

6

Page 7: A very brief introduction to R - Drexel University

Probability Distributions

• Discrete probability distributions

• Continuous probability distributions

• Extreme value theory

7

Page 8: A very brief introduction to R - Drexel University

Statistical Tests

• Parametric Tests

• Discrete variables and the Chi^2 test

• Non-parametric tests

8

Page 9: A very brief introduction to R - Drexel University

Regression

• Linear regression

• Non-linear regression

9

Page 10: A very brief introduction to R - Drexel University

Generalized Linear Models

• Naive Bayes classifyer

• Discriminant Analysis

• Logistic Regression

10

Page 11: A very brief introduction to R - Drexel University

Reading data into R

• R is not well suited for data preprocessing;• Preprocess data elsewhere (SPSS, etc…);• Easiest form of data to input: text file;• Spreadsheet like data:

– Small/medium size: read.table()– Large data: scan()

• Read from other systems: – Use the library “foreign”: library(foreign)– Can import from SAS, SPSS, Epi Info– Can export to STATA

11

Page 12: A very brief introduction to R - Drexel University

Reading data into R (cont.)

• R commander– Package: Rcmdr– >library(Rcmdr)

12

Page 13: A very brief introduction to R - Drexel University

Naming conventions

• Any roman letters, digits, and ‘.’ (non-initial position);

• Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi, range, rank, tree, var; and

• Hold for variables, data, and functions.

13

Page 14: A very brief introduction to R - Drexel University

Defining new variables

• Assignment symbol, use “<-” (or _)• Scalars

– scal<-6– value<-7

• Vectors– vec<-c(0,1,2)– vec2<-c(1:10)– vec3<-c(8,6,4,2,10,12,14)– famnames<-c("Kate", "Andrew", "Brian")

• Variable names are case sensitive

14

Page 15: A very brief introduction to R - Drexel University

Frequently used operators<- Assign+ Sum- Difference* Multiplication/ Division^ Exponent%% Mod%*% Dot product%/% Integer division%in% Subset

| Or& And< Less> Greater<= Less or =>= Greater or =! Not!= Not equal== Is equal

15

Page 16: A very brief introduction to R - Drexel University

Frequently used functionsc Concatenatecbind,rbind

Concatenate vectors

min Minimummax Maximumlength # valuesdim # rows, colsfloor Max integer inwhich TRUE indicestable Counts

summary Generic stats Sort, order, rank

Sort, order, rank a vector

print Show valuecat Print as charpaste c() as charround Roundapply Repeat over

rows, cols16

Page 17: A very brief introduction to R - Drexel University

Statistical functionsrnorm, dnorm, pnorm, qnorm

Normal distribution random sample, density, cdf and quantiles

lm, glm, anova Model fittingloess, lowess Smooth curve fittingsample Resampling (bootstrap, permutation).Random.seed Random number generation

mean, median Location statisticsvar, cor, cov, mad, range

Scale statistics

svd, qr, chol, eigen

Linear algebra

17

Page 18: A very brief introduction to R - Drexel University

Graphical functionsplot Generic plot eg: scatterpoints Add pointslines, abline Add linestext, mtext Add textlegend Add a legendaxis Add axesbox Add box around all axespar Plotting parameterscolors, palette Use colors

18

Page 19: A very brief introduction to R - Drexel University

Writing R code

• Can input lines one at a time into R

• Can write many lines of code in a text editor and run all at once– Using Windows version, simply paste the

commands into R

– Using Unix version, save the commands and run in batch mode

19

Page 20: A very brief introduction to R - Drexel University

Types of commands

• Defining variables

• Inputting data

• Using built-in functions

• Using the help menu and notation– ?functionname, help.search(“functionname”)

• Writing your own functions

20

Page 21: A very brief introduction to R - Drexel University

Language layout

• Three types of statement– expression: it is evaluated, printed, and the value

is lost (3+5)

– assignment: passes the value to a variable but the result is not printed automatically (out<-3+5)

– comment: (#This is a comment)

21

Page 22: A very brief introduction to R - Drexel University

Loops and conditionals

• Conditional– if (expr) expr– if (expr) expr else expr

• Iteration– repeat expr– while (expr) expr– for (name in expr1) expr

• For comparisons use:– == for equal– != for not equal– > for greater than– && for and– | for or

22

Page 23: A very brief introduction to R - Drexel University

Plot Command

• The basic command-line command for producing a scatter plot or line graph.– col= set colors, – lty= set line types, – lwd= set line widths, – pch= set the character type, – type= pick points (type = "p"), lines ("l"), – cex= set the "character expansion“, – xlab= and ylab= set the labels, – xlim= and ylim= set the limits of the axes,– main= put a title on the plot, – mtext= add a sub-title,– help (par) for details

23

Page 24: A very brief introduction to R - Drexel University

One-Dimensional Plots

• barplot(height) #simple form

• barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE)

• boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE)

• hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)

24

Page 25: A very brief introduction to R - Drexel University

Two-Dimensional Plots

• lines(x, y, type="l")

• points(x, y, type="p"))

• matplot(x, y, type="p", lty=1:5, pch=, col=1:4)

• matpoints(x, y, type="p", lty=1:5, pch=, col=1:4)

• matlines(x, y, type="l", lty=1:5, pch=, col=1:4)

• plot(x, y, type="p", log="")

• abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=)

• qqplot(x, y, plot=TRUE)

• qqnorm(x, datax=FALSE, plot=TRUE)

25

Page 26: A very brief introduction to R - Drexel University

Three-Dimensional Plots

• contour(x, y, z, v, nint=5, add=FALSE, labex)

• interp(x, y, z, xo, yo, ncp=0, extrap=FALSE)

• persp(z, eye=c(-6,-8,5), ar=1)

26

Page 27: A very brief introduction to R - Drexel University

Basic Graphics

• Histogram– hist(D$wg)

27

Page 28: A very brief introduction to R - Drexel University

Basic Graphics

• Add a title…– The “main” statement

will give the plot an overall heading.

– hist(D$wg , main=‘Weight Gain’)

28

Page 29: A very brief introduction to R - Drexel University

Basic Graphics

• Adding axis labels…

• Use “xlab” and “ylab” to label the X and Y axes, respectively.

• hist(D$wg , main=‘Weight Gain’,xlab=‘Weight Gain’, ylab =‘Frequency’)

29

Page 30: A very brief introduction to R - Drexel University

Basic Graphics

• Changing colors…

• Use the col statement.– ?colors will give you help

on the colors.

– Common colors may simply put in using the name.

– hist(D$wg, main=“Weight Gain”,xlab=“Weight Gain”, ylab =“Frequency”, col=“blue”)

30

Page 31: A very brief introduction to R - Drexel University

Basic Graphics – Colors

31

Page 32: A very brief introduction to R - Drexel University

Scatter Plots

• Suppose we have two variables and we wish to see the relationship between them.

• A scatter plot works very well.

• R code: – plot(x,y)

• Example– plot(D$metmin,D$wg)

32

Page 33: A very brief introduction to R - Drexel University

Scatterplots

33

Page 34: A very brief introduction to R - Drexel University

Scatterplots

plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain', xlab='Mets (min)',ylab='Weight Gain (lbs)')

34

Page 35: A very brief introduction to R - Drexel University

Scatterplots

plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain',

xlab='Mets (min)',ylab='Weight Gain (lbs)',pch=2)35

Page 36: A very brief introduction to R - Drexel University

Line Plots

• Often data comes through time.

• Consider Dell stock– D2 <- read.csv("H:\\Dell.csv",header=TRUE)

– t1 <- 1:nrow(D2)

– plot(t1,D2$DELL)

36

Page 37: A very brief introduction to R - Drexel University

Line Plots

37

Page 38: A very brief introduction to R - Drexel University

Line Plots

plot(t1,D2$DELL,type="l") 38

Page 39: A very brief introduction to R - Drexel University

Line Plots

plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))

39

Page 40: A very brief introduction to R - Drexel University

Overlaying Plots

• Often we have more than one variable measured against the same predictor (X).– plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))

– lines(t1,D2$Intel)

40

Page 41: A very brief introduction to R - Drexel University

Overlaying Graphs

41

Page 42: A very brief introduction to R - Drexel University

Overlaying Graphs

lines(t1,D2$Intel,lty=2) 42

Page 43: A very brief introduction to R - Drexel University

Overlaying Graphs

43

Page 44: A very brief introduction to R - Drexel University

Adding a Legend

• Adding a legend is a bit tricky in R.

• Syntax• legend( x, y, names, line types)

X

coordinateY

coordinate

Names of series in column format

Corresponding line types

44

Page 45: A very brief introduction to R - Drexel University

Adding a Legend

legend(60,45,c('Intel','Dell'),lty=c(1,2))45

Page 46: A very brief introduction to R - Drexel University

Good resources

• http://cran.r-project.org/manuals.html

• Statistics with R– 17 chapters, 1266 pages;

– High definition images;

– All the codes are online; and

– Freely available at http://zoonek2.free.fr/UNIX/48_R/all.html

46

Page 47: A very brief introduction to R - Drexel University

SOME EXAMPLES

47

Page 48: A very brief introduction to R - Drexel University

Histogram

• library(e1071) # For the "skewness" and "kurtosis" functions

• n <- 1000 • x <- rnorm(n) • op <- par(mar=c(3,3,4,2)+.1) • hist(x, col="light blue", probability=TRUE,

main=paste("skewness =", round(skewness(x), digits=2)), xlab="", ylab="")

• lines(density(x), col="red", lwd=3) • par(op)

48

Page 49: A very brief introduction to R - Drexel University

Histogram (accumulative)

• op <- par(mfcol=c(2,4), mar=c(2,2,1,1)+.1) • do.it <- function (x) { hist(x, probability=T, col='light blue',

xlab="", ylab="", main="", axes=F) axis(1) lines(density(x), col='red', lwd=3) x <- sort(x) q <- ppoints(length(x)) plot(q~x, type='l', xlab="", ylab="", main="") abline(h=c(.25,.5,.75), lty=3, lwd=3, col='blue') }

• n <- 200 • do.it(rnorm(n)) • do.it(rlnorm(n)) • do.it(-rlnorm(n)) • do.it(rnorm(n, c(-5,5))) • par(op)

49

Page 50: A very brief introduction to R - Drexel University

Histogram (Old Faithful Geyser Data)

• Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

• hist(faithful$eruptions, probability=TRUE, breaks=20, col="light blue", xlab="", ylab="", main="Histogram and density estimation")

• points(density(faithful$eruptions, bw=1), type='l', lwd=3, col='black')

• points(density(faithful$eruptions, bw=.5), type='l', lwd=3, col='blue')

• points(density(faithful$eruptions, bw=.3), type='l', lwd=3, col='green')

• points(density(faithful$eruptions, bw=.1), type='l', lwd=3, col='red')

50

Page 51: A very brief introduction to R - Drexel University

Lines

• library(e1071) # For the "skewness" and "kurtosis" functions

• n <- 1000 • x <- rnorm(n) • qqnorm(x, main=paste("kurtosis =", round(kurtosis(x),

digits=2), "(gaussian)")) • qqline(x, col="red") • op <- par(fig=c(.02,.5,.5,.98), new=TRUE) • hist(x, probability=T, col="light blue", xlab="", ylab="",

main="", axes=F) • lines(density(x), col="red", lwd=2) • box() • par(op)

51

Page 52: A very brief introduction to R - Drexel University

Dot chart (Areas of the World's Major Landmasses)

• The areas in thousands of square miles of the landmasses which exceed 10,000 square miles.

• data(islands)

• dotchart(islands, main="Island area")

• dotchart(sort(log(islands)), main="Island area (logarithmic scale)")

52

Page 53: A very brief introduction to R - Drexel University

Scatter plot (Old Faithful Geyser Data)

• op <- par(mar=c(3,4,2,2)+.1)

• plot(sort(faithful$eruptions), xlab="")

• rug(faithful$eruptions, side=2)

• par(op)

53

Page 54: A very brief introduction to R - Drexel University

Scatter plot (Edgar Anderson's Iris Data)

• This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

• data(iris) • plot(iris[1:4], pch = 21, bg = c("red", "green",

"blue")[ as.numeric(iris$Species) ])

54

Page 55: A very brief introduction to R - Drexel University

Scatter plot (Lawyers' Ratings of State Judges in the US Superior Court)

• Lawyers' ratings of state judges in the US Superior Court.

• pairs(USJudgeRatings, gap=0)

55

Page 56: A very brief introduction to R - Drexel University

Scatter plot (Longley's Economic Regression Data)

• pairs(longley, • gap=0, • diag.panel = function (x, ...) { • par(new = TRUE) • hist(x, col = "light blue", probability = TRUE,

axes = FALSE, main = "") • lines(density(x), col = "red", lwd = 3) • rug(x) })

56

Page 57: A very brief introduction to R - Drexel University

Clustering (Lawyers' Ratings of State Judges in the US Superior Court)

• heatmap(as.matrix(USJudgeRatings))

57

Page 58: A very brief introduction to R - Drexel University

Kernel Density Estimation

• data(faithful)

• x <- faithful$eruptions

• y <- faithful$waiting

• library(MASS)

• library(fields)

• z <- kde2d(x, y, n=300)

• image.plot(z)

58

Page 59: A very brief introduction to R - Drexel University

Contour

• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=300)• contour(z, col = "red", main = "Density

estimation: contour plot")

59

Page 60: A very brief introduction to R - Drexel University

KDE+Contour

• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=300)• image.plot(z)• contour(z, col = "red", add=T)

60

Page 61: A very brief introduction to R - Drexel University

3-D KDE

• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=100)• op <- par(mar=c(0,0,2,0)+.1) • persp(z, phi = 45, theta = 30, xlab = "eruptions", ylab =

"waiting", zlab = "density", col = "yellow", shade = .5, border = NA, main = "Density estimation: perspective plot")

• par(op)

61

Page 62: A very brief introduction to R - Drexel University

References

• http://zoonek2.free.fr/UNIX/48_R/all.html

• http://www.pitt.edu/~super7/17011-18001/17641.ppt

• http://isites.harvard.edu/fs/docs/icb.topic154887.files/Intro_to_R.ppt

62