121
Graphics and Data Visualization in R Overview Thomas Girke December 13, 2013 Graphics and Data Visualization in R Slide 1/121

Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Embed Size (px)

Citation preview

Page 1: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Graphics and Data Visualization in ROverview

Thomas Girke

December 13, 2013

Graphics and Data Visualization in R Slide 1/121

Page 2: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Slide 2/121

Page 3: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Overview Slide 3/121

Page 4: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Graphics in R

Powerful environment for visualizing scientific data

Integrated graphics and statistics infrastructure

Publication quality graphics

Fully programmable

Highly reproducible

Full LATEX Link & Sweave Link support

Vast number of R packages with graphics utilities

Graphics and Data Visualization in R Overview Slide 4/121

Page 5: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Documentation on Graphics in R

General

Graphics Task Page Link

R Graph Gallery Link

R Graphical Manual Link

Paul Murrell’s book R (Grid) Graphics Link

Interactive graphics

rggobi (GGobi) Link

iplots Link

Open GL (rgl) Link

Graphics and Data Visualization in R Overview Slide 5/121

Page 6: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Graphics Environments

Viewing and saving graphics in R

On-screen graphics

postscript, pdf, svg

jpeg/png/wmf/tiff/...

Four major graphic environments

Low-level infrastructure

R Base Graphics (low- and high-level)grid : Manual Link , Book Link

High-level infrastructure

lattice: Manual Link , Intro Link , Book Link

ggplot2: Manual Link , Intro Link , Book Link

Graphics and Data Visualization in R Overview Slide 6/121

Page 7: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Graphics Environments Slide 7/121

Page 8: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 8/121

Page 9: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Base Graphics: Overview

Important high-level plotting functions

plot: generic x-y plotting

barplot: bar plots

boxplot: box-and-whisker plot

hist: histograms

pie: pie charts

dotchart: cleveland dot plots

image, heatmap, contour, persp: functions to generate image-likeplots

qqnorm, qqline, qqplot: distribution comparison plots

pairs, coplot: display of multivariant data

Help on these functions

?myfct

?plot

?par

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 9/121

Page 10: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Base Graphics: Preferred Input Data Objects

Matrices and data frames

Vectors

Named vectors

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 10/121

Page 11: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Scatter Plot: very basicSample data set for subsequent plots

> set.seed(1410)

> y <- matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))

> plot(y[,1], y[,2])

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

y[, 1]

y[, 2

]

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 11/121

Page 12: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Scatter Plot: all pairs

> pairs(y)

A

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

B ●

0.2 0.4 0.6 0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

C

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 12/121

Page 13: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Scatter Plot: with labels

> plot(y[,1], y[,2], pch=20, col="red", main="Symbols and Labels")

> text(y[,1]+0.03, y[,2], rownames(y))

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

Symbols and Labels

y[, 1]

y[, 2

] a

b

c

d

e

f

g

h

i

j

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 13/121

Page 14: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Scatter Plots: more examples

Print instead of symbols the row names

> plot(y[,1], y[,2], type="n", main="Plot of Labels")

> text(y[,1], y[,2], rownames(y))

Usage of important plotting parameters

> grid(5, 5, lwd = 2)

> op <- par(mar=c(8,8,8,8), bg="lightblue")

> plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2,

+ cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label",

+ ylab="y label", main="My Main", sub="My Sub")

> par(op)

Important arguments

mar: specifies the margin sizes around the plotting area in order: c(bottom,

left, top, right)

col: color of symbols

pch: type of symbols, samples: example(points)

lwd: size of symbols

cex.*: control font sizes

For details see ?par

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 14/121

Page 15: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Scatter Plots: more examples

Add a regression line to a plot

> plot(y[,1], y[,2])

> myline <- lm(y[,2]~y[,1]); abline(myline, lwd=2)

> summary(myline)

Same plot as above, but on log scale

> plot(y[,1], y[,2], log="xy")

Add a mathematical expression to a plot

> plot(y[,1], y[,2]); text(y[1,1], y[1,2],

> expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3)

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 15/121

Page 16: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Exercise 1: Scatter Plots

Task 1 Generate scatter plot for first two columns in iris data frame and color dots byits Species column.

Task 2 Use the xlim/ylim arguments to set limits on the x- and y-axes so that all datapoints are restricted to the left bottom quadrant of the plot.

Structure of iris data set:

> class(iris)

[1] "data.frame"

> iris[1:4,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

> table(iris$Species)

setosa versicolor virginica

50 50 50

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 16/121

Page 17: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Line Plot: Single Data Set

> plot(y[,1], type="l", lwd=2, col="blue")

2 4 6 8 10

0.2

0.4

0.6

0.8

Index

y[, 1

]

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 17/121

Page 18: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Line Plots: Many Data Sets> split.screen(c(1,1));

[1] 1

> plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1)

> for(i in 2:length(y[1,])) {

+ screen(1, new=FALSE)

+ plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="",

+ xlab="", main="", bty="n")

+ }

> close.screen(all=TRUE)

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Measurement

Inte

nsity

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 18/121

Page 19: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Bar Plot Basics

> barplot(y[1:4,], ylim=c(0, max(y[1:4,])+0.3), beside=TRUE,

+ legend=letters[1:4])

> text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1)

+ +sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.04)

A B C

abcd

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.27

0.53

0.93

0.14

0.47

0.31

0.05

0.12

0.44

0.32

0

0.41

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 19/121

Page 20: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Bar Plots with Error Bars

> bar <- barplot(m <- rowMeans(y) * 10, ylim=c(0, 10))

> stdev <- sd(t(y))

> arrows(bar, m, bar, m + stdev, length=0.15, angle = 90)

a b c d e f g h i j

02

46

810

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 20/121

Page 21: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Mirrored Bar Plots

> df <- data.frame(group = rep(c("Above", "Below"), each=10), x = rep(1:10, 2), y = c(runif(10, 0, 1), runif(10, -1, 0)))

> plot(c(0,12),range(df$y),type = "n")

> barplot(height = df$y[df$group == 'Above'], add = TRUE,axes = FALSE)

> barplot(height = df$y[df$group == 'Below'], add = TRUE,axes = FALSE)

0 2 4 6 8 10 12

−1.

0−

0.5

0.0

0.5

c(0, 12)

rang

e(df

$y)

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 21/121

Page 22: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Histograms

> hist(y, freq=TRUE, breaks=10)

Histogram of y

y

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 22/121

Page 23: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Density Plots

> plot(density(y), col="red")

0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

density.default(x = y)

N = 30 Bandwidth = 0.136

Den

sity

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 23/121

Page 24: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Pie Charts

> pie(y[,1], col=rainbow(length(y[,1]), start=0.1, end=0.8), clockwise=TRUE)

> legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8,

+ col=rainbow(length(y[,1]), start=0.1, end=0.8), ncol=1)

a

b

c

d

ef

g

h

ij a

bcdefghij

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 24/121

Page 25: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Color Selection Utilities

Default color palette and how to change it

> palette()

[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" "gray"

> palette(rainbow(5, start=0.1, end=0.2))

> palette()

[1] "#FF9900" "#FFBF00" "#FFE600" "#F2FF00" "#CCFF00"

> palette("default")

The gray function allows to select any type of gray shades by providing values from 0to 1

> gray(seq(0.1, 1, by= 0.2))

[1] "#1A1A1A" "#4D4D4D" "#808080" "#B3B3B3" "#E6E6E6"

Color gradients with colorpanel function from gplots library

> library(gplots)

> colorpanel(5, "darkblue", "yellow", "white")

Much more on colors in R see Earl Glynn’s color chart Link

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 25/121

Page 26: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Arranging Several Plots on Single Page

With par(mfrow=c(nrow,ncol)) one can define how several plots are arranged nextto each other.

> par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) }

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 26/121

Page 27: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Arranging Plots with Variable WidthThe layout function allows to divide the plotting device into variable numbers of rowsand columns with the column-widths and the row-heights specified in the respectivearguments.

> nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE), c(3,7), c(5,5),

+ respect=TRUE)

> # layout.show(nf)

> for(i in 1:3) { barplot(1:10) }

02

46

810

02

46

810

02

46

810

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 27/121

Page 28: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Saving Graphics to Files

After the pdf() command all graphs are redirected to file test.pdf. Works for allcommon formats similarly: jpeg, png, ps, tiff, ...

> pdf("test.pdf"); plot(1:10, 1:10); dev.off()

Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphicsprograms, such as InkScape.

> svg("test.svg"); plot(1:10, 1:10); dev.off()

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 28/121

Page 29: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Exercise 2: Bar Plots

Task 1 Calculate the mean values for the Species components of the first four columnsin the iris data set. Organize the results in a matrix where the row names arethe unique values from the iris Species column and the column names arethe same as in the first four iris columns.

Task 2 Generate two bar plots: one with stacked bars and one with horizontallyarranged bars.

Structure of iris data set:

> class(iris)

[1] "data.frame"

> iris[1:4,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

> table(iris$Species)

setosa versicolor virginica

50 50 50

Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 29/121

Page 30: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Graphics Environments Grid Graphics Slide 30/121

Page 31: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

grid Graphics Environment

What is grid?

Low-level graphics systemHighly flexible and controllable systemDoes not provide high-level functionsIntended as development environment for custom plottingfunctionsPre-installed on new R distributions

Documentation and Help

Manual Link

Book Link

Graphics and Data Visualization in R Graphics Environments Grid Graphics Slide 31/121

Page 32: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Graphics Environments lattice Slide 32/121

Page 33: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

lattice Environment

What is lattice?

High-level graphics systemDeveloped by Deepayan SarkarImplements Trellis graphics system from S-PlusSimplifies high-level plotting tasks: arranging complexgraphical featuresSyntax similar to R’s base graphics

Documentation and Help

Manual Link

Intro Link

Book Link

library(help=lattice) opens a list of all functionsavailable in the lattice packageAccessing and changing global parameters:?lattice.options and ?trellis.device

Graphics and Data Visualization in R Graphics Environments lattice Slide 33/121

Page 34: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Scatter Plot Sample

> library(lattice)

> p1 <- xyplot(1:8 ~ 1:8 | rep(LETTERS[1:4], each=2), as.table=TRUE)

> plot(p1)

1:8

1:8

2

4

6

8

A

2 4 6 8

B

2 4 6 8

C

2

4

6

8

D

Graphics and Data Visualization in R Graphics Environments lattice Slide 34/121

Page 35: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Line Plot Sample

> library(lattice)

> p2 <- parallelplot(~iris[1:4] | Species, iris, horizontal.axis = FALSE,

+ layout = c(1, 3, 1))

> plot(p2)

Min

Max

Sepal.Length Sepal.Width Petal.Length Petal.Width

setosa

Min

Maxversicolor

Min

Maxvirginica

Graphics and Data Visualization in R Graphics Environments lattice Slide 35/121

Page 36: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 36/121

Page 37: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot2 Environment

What is ggplot2?

High-level graphics systemImplements grammar of graphics from Leland Wilkinson Link

Streamlines many graphics workflows for complex plotsSyntax centered around main ggplot functionSimpler qplot function provides many shortcuts

Documentation and Help

Manual Link

Intro Link

Book Link

Cookbook for R Link

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 37/121

Page 38: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot2 Usage

ggplot function accepts two arguments

Data set to be plottedAesthetic mappings provided by aes function

Additional parameters such as geometric objects (e.g. points,lines, bars) are passed on by appending them with + asseparator.

List of available geom_* functions: Link

Settings of plotting theme can be accessed with the commandtheme_get() and its settings can be changed with theme().

Preferred input data object

qgplot: data.frame (support for vector, matrix, ...)ggplot: data.frame

Packages with convenience utilities to create expected inputs

plyrreshape

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 38/121

Page 39: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

qplot Function

qplot syntax is similar to R’s basic plot function

Arguments:

x: x-coordinates (e.g. col1)y: y-coordinates (e.g. col2)data: data frame with corresponding column namesxlim, ylim: e.g. xlim=c(0,10)log: e.g. log="x" or log="xy"

main: main title; see ?plotmath for mathematical formulaxlab, ylab: labels for the x- and y-axescolor, shape, size

...: many arguments accepted by plot function

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 39/121

Page 40: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

qplot: Scatter Plots

Create sample data

> library(ggplot2)

> x <- sample(1:10, 10); y <- sample(1:10, 10); cat <- rep(c("A", "B"), 5)

Simple scatter plot

> qplot(x, y, geom="point")

Prints dots with different sizes and colors

> qplot(x, y, geom="point", size=x, color=cat,

+ main="Dot Size and Color Relative to Some Values")

Drops legend

> qplot(x, y, geom="point", size=x, color=cat) +

+ theme(legend.position = "none")

Plot different shapes

> qplot(x, y, geom="point", size=5, shape=cat)

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 40/121

Page 41: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

qplot: Scatter Plot with qplot

> p <- qplot(x, y, geom="point", size=x, color=cat,

+ main="Dot Size and Color Relative to Some Values") +

+ theme(legend.position = "none")

> print(p)

2.5

5.0

7.5

10.0

2.5 5.0 7.5 10.0x

y

Dot Size and Color Relative to Some Values

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 41/121

Page 42: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

qplot: Scatter Plot with Regression Line

> set.seed(1410)

> dsmall <- diamonds[sample(nrow(diamonds), 1000), ]

> p <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"),

+ method = "lm")

> print(p)

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●●

●●

●●

● ●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●● ●

●●●

●●

●●

●●

●●●

●●

0

10000

20000

1 2 3carat

pric

e

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 42/121

Page 43: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

qplot: Scatter Plot with Local Regression Curve (loess)

> p <- qplot(carat, price, data=dsmall, geom=c("point", "smooth"), span=0.4)

> print(p) # Setting 'se=FALSE' removes error shade

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

0

5000

10000

15000

1 2 3carat

pric

e

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 43/121

Page 44: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot Function

More important than qplot to access full functionality of ggplot2

Main arguments

data set, usually a data.frame

aesthetic mappings provided by aes function

General ggplot syntax

ggplot(data, aes(...)) + geom_*() + ... + stat_*() + ...

Layer specifications

geom_*(mapping, data, ..., geom, position)

stat_*(mapping, data, ..., stat, position)

Additional components

scales

coordinates

facet

aes() mappings can be passed on to all components (ggplot, geom_*, etc.).

Effects are global when passed on to ggplot() and local for other components.

x, y

color: grouping vector (factor)group: grouping vector (factor)

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 44/121

Page 45: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Changing Plotting Themes with ggplot

Theme settings can be accessed with theme_get()

Their settings can be changed with theme()

Some examples

Change background color to white... + theme(panel.background=element_rect(fill = "white", colour = "black"))

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 45/121

Page 46: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Storing ggplot Specifications

Plots and layers can be stored in variables

> p <- ggplot(dsmall, aes(carat, price)) + geom_point()

> p # or print(p)

Returns information about data and aesthetic mappings followed by each layer

> summary(p)

Prints dots with different sizes and colors

> bestfit <- geom_smooth(methodw = "lm", se = F, color = alpha("steelblue", 0.5), size = 2)

> p + bestfit # Plot with custom regression line

Syntax to pass on other data sets

> p %+% diamonds[sample(nrow(diamonds), 100),]

Saves plot stored in variable p to file

> ggsave(p, file="myplot.pdf")

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 46/121

Page 47: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Scatter Plot

> p <- ggplot(dsmall, aes(carat, price, color=color)) +

+ geom_point(size=4)

> print(p)

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●●

●●●●●

●●●●●●

●●●●●

●●

●●

●●●

●●

0

5000

10000

15000

1 2 3carat

pric

e

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 47/121

Page 48: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Scatter Plot with Regression Line

> p <- ggplot(dsmall, aes(carat, price)) + geom_point() +

+ geom_smooth(method="lm", se=FALSE) +

+ theme(panel.background=element_rect(fill = "white", colour = "black"))

> print(p)

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●●●

●●

●●

● ●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●● ●

●●●

●●

●●

●●

●●●

●●

0

5000

10000

15000

20000

25000

1 2 3carat

pric

e

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 48/121

Page 49: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Scatter Plot with Several Regression Lines

> p <- ggplot(dsmall, aes(carat, price, group=color)) +

+ geom_point(aes(color=color), size=2) +

+ geom_smooth(aes(color=color), method = "lm", se=FALSE)

> print(p)

● ●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●●

●●

●●

●●

●●●

●●

0

5000

10000

15000

20000

1 2 3carat

pric

e

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 49/121

Page 50: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Scatter Plot with Local Regression Curve (loess)

> p <- ggplot(dsmall, aes(carat, price)) + geom_point() + geom_smooth()

> print(p) # Setting 'se=FALSE' removes error shade

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

0

5000

10000

15000

1 2 3carat

pric

e

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 50/121

Page 51: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Line Plot

> p <- ggplot(iris, aes(Petal.Length, Petal.Width, group=Species,

+ color=Species)) + geom_line()

> print(p)

0.0

0.5

1.0

1.5

2.0

2.5

2 4 6Petal.Length

Pet

al.W

idth Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 51/121

Page 52: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Faceting

> p <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +

+ geom_line(aes(color=Species), size=1) +

+ facet_wrap(~Species, ncol=1)

> print(p)

setosa

versicolor

virginica

2.0

2.5

3.0

3.5

4.0

4.5

2.0

2.5

3.0

3.5

4.0

4.5

2.0

2.5

3.0

3.5

4.0

4.5

5 6 7 8Sepal.Length

Sep

al.W

idth Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 52/121

Page 53: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Exercise 3: Scatter Plots

Task 1 Generate scatter plot for first two columns in iris data frame and color dots byits Species column.

Task 2 Use the xlim, ylim functionss to set limits on the x- and y-axes so that alldata points are restricted to the left bottom quadrant of the plot.

Task 3 Generate corresponding line plot with faceting show individual data sets insaparate plots.

Structure of iris data set:

> class(iris)

[1] "data.frame"

> iris[1:4,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

> table(iris$Species)

setosa versicolor virginica

50 50 50

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 53/121

Page 54: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Bar Plots

Sample Set: the following transforms the iris data set into a ggplot2-friendly format.

Calculate mean values for aggregates given by Species column in iris data set

> iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean)

Calculate standard deviations for aggregates given by Species column in iris data set

> iris_sd <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=sd)

Convert iris_mean with melt

> library(reshape2) # Defines melt function

> df_mean <- melt(iris_mean, id.vars=c("Species"), variable.name = "Samples", value.name="Values")

Convert iris_sd with melt

> df_sd <- melt(iris_sd, id.vars=c("Species"), variable.name = "Samples", value.name="Values")

Define standard deviation limits

> limits <- aes(ymax = df_mean[,"Values"] + df_sd[,"Values"], ymin=df_mean[,"Values"] - df_sd[,"Values"])

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 54/121

Page 55: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Bar Plot

> p <- ggplot(df_mean, aes(Samples, Values, fill = Species)) +

+ geom_bar(position="dodge", stat="identity")

> print(p)

0

2

4

6

Sepal.Length Sepal.Width Petal.Length Petal.WidthSamples

Val

ues

Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 55/121

Page 56: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Bar Plot Sideways

> p <- ggplot(df_mean, aes(Samples, Values, fill = Species)) +

+ geom_bar(position="dodge", stat="identity") + coord_flip() +

+ theme(axis.text.y=theme_text(angle=0, hjust=1))

> print(p)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

0 2 4 6Values

Sam

ples

Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 56/121

Page 57: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Bar Plot with Faceting

> p <- ggplot(df_mean, aes(Samples, Values)) + geom_bar(aes(fill = Species), stat="identity") +

+ facet_wrap(~Species, ncol=1)

> print(p)

setosa

versicolor

virginica

0

2

4

6

0

2

4

6

0

2

4

6

Sepal.Length Sepal.Width Petal.Length Petal.WidthSamples

Val

ues

Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 57/121

Page 58: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Bar Plot with Error Bars

> p <- ggplot(df_mean, aes(Samples, Values, fill = Species)) +

+ geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge")

> print(p)

0

2

4

6

Sepal.Length Sepal.Width Petal.Length Petal.WidthSamples

Val

ues

Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 58/121

Page 59: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Changing Color Settings

> library(RColorBrewer)

> # display.brewer.all()

> p <- ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) +

+ geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge") +

+ scale_fill_brewer(palette="Blues") + scale_color_brewer(palette = "Greys")

> print(p)

0

2

4

6

Sepal.Length Sepal.Width Petal.Length Petal.WidthSamples

Val

ues

Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 59/121

Page 60: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Using Standard Colors

> p <- ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) +

+ geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge") +

+ scale_fill_manual(values=c("red", "green3", "blue")) +

+ scale_color_manual(values=c("red", "green3", "blue"))

> print(p)

0

2

4

6

Sepal.Length Sepal.Width Petal.Length Petal.WidthSamples

Val

ues

Species

setosa

versicolor

virginica

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 60/121

Page 61: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Mirrored Bar Plots

> df <- data.frame(group = rep(c("Above", "Below"), each=10), x = rep(1:10, 2), y = c(runif(10, 0, 1), runif(10, -1, 0)))

> p <- ggplot(df, aes(x=x, y=y, fill=group)) +

+ geom_bar(stat="identity", position="identity")

> print(p)

−1.0

−0.5

0.0

0.5

2.5 5.0 7.5 10.0x

y

group

Above

Below

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 61/121

Page 62: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Exercise 4: Bar Plots

Task 1 Calculate the mean values for the Species components of the first four columnsin the iris data set. Use the melt function from the reshape2 package to bringthe results into the expected format for ggplot.

Task 2 Generate two bar plots: one with stacked bars and one with horizontallyarranged bars.

Structure of iris data set:

> class(iris)

[1] "data.frame"

> iris[1:4,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

> table(iris$Species)

setosa versicolor virginica

50 50 50

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 62/121

Page 63: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Data Reformatting Example for Line Plot> y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("Sample", 1:5, sep="")))

> y <- data.frame(Position=1:length(y[,1]), y)

> y[1:4, ] # First rows of input format expected by melt()

Position Sample1 Sample2 Sample3 Sample4 Sample5

g1 1 1.0002088 0.6850199 -0.21324932 1.27195056 1.0479301

g2 2 -1.2024596 -1.5004962 -0.01111579 0.07584497 -0.7100662

g3 3 0.1023678 -0.5153367 0.28564390 1.41522878 1.1084695

g4 4 1.3294248 -1.2084007 -0.19581898 -0.42361768 1.7139697

> df <- melt(y, id.vars=c("Position"), variable.name = "Samples", value.name="Values")

> p <- ggplot(df, aes(Position, Values)) + geom_line(aes(color=Samples)) + facet_wrap(~Samples, ncol=1)

> print(p)

> ## Represent same data in box plot

> ## ggplot(df, aes(Samples, Values, fill=Samples)) + geom_boxplot()

Sample1

Sample2

Sample3

Sample4

Sample5

−4

−2

0

2

−4

−2

0

2

−4

−2

0

2

−4

−2

0

2

−4

−2

0

2

0 25 50 75 100Position

Val

ues

Samples

Sample1

Sample2

Sample3

Sample4

Sample5

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 63/121

Page 64: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Jitter Plots

> p <- ggplot(dsmall, aes(color, price/carat)) +

+ geom_jitter(alpha = I(1 / 2), aes(color=color))

> print(p)

5000

10000

D E F G H I Jcolor

pric

e/ca

rat

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 64/121

Page 65: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Box Plots

> p <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot()

> print(p)

●●

●●

5000

10000

D E F G H I Jcolor

pric

e/ca

rat

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 65/121

Page 66: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Density Plot with Line Coloring

> p <- ggplot(dsmall, aes(carat)) + geom_density(aes(color = color))

> print(p)

0.0

0.5

1.0

1.5

1 2 3carat

dens

ity

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 66/121

Page 67: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Density Plot with Area Coloring

> p <- ggplot(dsmall, aes(carat)) + geom_density(aes(fill = color))

> print(p)

0.0

0.5

1.0

1.5

1 2 3carat

dens

ity

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 67/121

Page 68: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Histograms

> p <- ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(y = ..density..,

+ fill = ..count..), binwidth=0.2) + geom_density()

> print(p)

0.0

0.4

0.8

1.2

2 3 4Sepal.Width

dens

ity

0

10

20

30

count

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 68/121

Page 69: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Pie Chart

> df <- data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")),

+ value=c(1,3,3,4,2))

> p <- ggplot(df, aes(x = "", y = value, fill = variable)) +

+ geom_bar(width = 1, stat="identity") +

+ coord_polar("y", start=pi / 3) + ggtitle("Pie Chart")

> print(p)

0.0

2.5

5.0

7.5

10.0

12.5

value

""

variable

bird

cat

dog

fly

mouse

Pie Chart

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 69/121

Page 70: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Wind Rose Pie Chart

> p <- ggplot(df, aes(x = variable, y = value, fill = variable)) +

+ geom_bar(width = 1, stat="identity") + coord_polar("y", start=pi / 3) +

+ ggtitle("Pie Chart")

> print(p)

1

2

3

0/4

bird

cat

dog

fly

mouse

value

varia

ble

variable

bird

cat

dog

fly

mouse

Pie Chart

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 70/121

Page 71: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Arranging Graphics on One Page

> library(grid)

> a <- ggplot(dsmall, aes(color, price/carat)) + geom_jitter(size=4, alpha = I(1 / 1.5), aes(color=color))

> b <- ggplot(dsmall, aes(color, price/carat, color=color)) + geom_boxplot()

> c <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() + theme(legend.position = "none")

> grid.newpage() # Open a new page on grid device

> pushViewport(viewport(layout = grid.layout(2, 2))) # Assign to device viewport with 2 by 2 grid layout

> print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2))

> print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))

> print(c, vp = viewport(layout.pos.row = 2, layout.pos.col = 2, width=0.3, height=0.3, x=0.8, y=0.8))

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 71/121

Page 72: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Arranging Graphics on One Page

color

pric

e/ca

rat

2000

4000

6000

8000

10000

D E F G H I J

color

D

E

F

G

H

I

J

color

pric

e/ca

rat

2000

4000

6000

8000

10000

●●

D E F G H I J

color

D

E

F

G

H

I

J

color

pric

e/ca

rat

2000

4000

6000

8000

10000

●●

D E F G H I J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 72/121

Page 73: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggplot: Inserting Graphics into Plots> # pdf("insert.pdf")

> print(a)

> print(b, vp=viewport(width=0.3, height=0.3, x=0.8, y=0.8))

> # dev.off()

color

pric

e/ca

rat

2000

4000

6000

8000

10000

D E F G H I J

color

D

E

F

G

H

I

J

color

pric

e/ca

rat

2000

4000

6000

8000

10000●

●●

DEFGHIJ

color

D

E

F

G

H

I

J

Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 73/121

Page 74: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Specialty Graphics Slide 74/121

Page 75: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Venn Diagrams (Code)

> source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")

> setlist5 <- list(A=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18))

> OLlist5 <- overLapper(setlist=setlist5, sep="_", type="vennsets")

> counts <- sapply(OLlist5$Venn_List, length)

> # pdf("venn.pdf")

> vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5))

> # dev.off()

Graphics and Data Visualization in R Specialty Graphics Slide 75/121

Page 76: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Venn Diagram (Plot)

Venn Diagram

Unique objects: All = 26; S1 = 18; S2 = 16; S3 = 20; S4 = 22; S5 = 18

0

0

00

0

00

1

0

0

0

0

0

01

1 0

2

2

0

1

2

01

2

21

0

3

25

A

B

C

D

E

Figure: Venn Diagram

Graphics and Data Visualization in R Specialty Graphics Slide 76/121

Page 77: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Compound Depictions with ChemmineR

> library(ChemmineR)

> data(sdfsample)

> plot(sdfsample[1], print=FALSE)

CMP1

●●

●●

●●●

O

O

OO

O

O

N

NNN

H

H

Graphics and Data Visualization in R Specialty Graphics Slide 77/121

Page 78: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Genome Graphics Slide 78/121

Page 79: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Genome Graphics ggbio Slide 79/121

Page 80: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

ggbio: A Programmable Genome Browser

A genome browser is a visulalization tool for plotting different types of genomicdata in separate tracks along chromosomes.The ggbio package (Yin et al., 2012) facilitates plotting of complex genome dataobjects, such as read alignments (SAM/BAM), genomic context/annotationinformation (gff/txdb), variant calls (VCF/BCF), and more. To easily compare thesedata sets, it extends the faceting facility of ggplot2 to genome browser-like tracks.Most of the core object types for handling genomic data with R/Bioconductor aresupported: GRanges, GAlignments, VCF, etc. For more details, see Table 1.1 of theggbio vignette Link .ggbio’s convenience plotting function is autoplot. For more customizable plots,one can use the generic ggplot function.Apart from the standard ggplot2 plotting components, ggbio defines serval newcomponents useful for genomic data visualization. A detailed list is given in Table1.2 of the vignette Link .

Useful web sites:

ggbio manual Link

ggbio functions Link

autoplot demo Link

Graphics and Data Visualization in R Genome Graphics ggbio Slide 80/121

Page 81: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Tracks: Aligning Plots Along Chromosomes

> library(ggbio)

> df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10)

> p1 <- qplot(data = df1, x = time, y = score, geom = "line")

> df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 +1))

> p2 <- ggplot(data = df2, aes(x = time, y = score)) + geom_line() + geom_point(size = 2, aes(color = value))

> tracks(time1 = p1, time2 = p2) + xlim(1, 40) + theme_tracks_sunset()

time1

−10

−5

0

5

10

scor

e

time2

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

−10

−5

0

5

10

scor

e

−3

−2

−1

0

1

2

value

NA

10 20 30 40

Graphics and Data Visualization in R Genome Graphics ggbio Slide 81/121

Page 82: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Plotting Genomic RangesGRanges objects are essential for storing alignment or annotation ranges inR/Bioconductor. The following creates a sample GRanges object and plots its content.

> library(GenomicRanges)

> set.seed(1); N <- 100; gr <- GRanges(seqnames = sample(c("chr1", "chr2", "chr3"), size = N, replace = TRUE), IRanges(start = sample(1:300, size = N, replace = TRUE), width = sample(70:75, size = N,replace = TRUE)), strand = sample(c("+", "-"), size = N, replace = TRUE), value = rnorm(N, 10, 3), score = rnorm(N, 100, 30), sample = sample(c("Normal", "Tumor"), size = N, replace = TRUE), pair = sample(letters, size = N, replace = TRUE))

> autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames)

chr1 chr2 chr3

+−

0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp

strand

+

Graphics and Data Visualization in R Genome Graphics ggbio Slide 82/121

Page 83: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Plotting Coverage Instead of Ranges

> autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames, stat = "coverage")

chr1 chr2 chr3

0.0

2.5

5.0

7.5

0.0

2.5

5.0

7.5

+−

0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp

Cov

erag

e strand

+

Graphics and Data Visualization in R Genome Graphics ggbio Slide 83/121

Page 84: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Mirrored Coverage Plot

> pos <- sapply(coverage(gr[strand(gr)=="+"]), as.numeric)

> pos <- data.frame(Chr=rep(names(pos), sapply(pos, length)), Strand=rep("+", length(unlist(pos))), Position=unlist(sapply(pos, function(x) 1:length(x))), Coverage=as.numeric(unlist(pos)))

> neg <- sapply(coverage(gr[strand(gr)=="-"]), as.numeric)

> neg <- data.frame(Chr=rep(names(neg), sapply(neg, length)), Strand=rep("-", length(unlist(neg))), Position=unlist(sapply(neg, function(x) 1:length(x))), Coverage=-as.numeric(unlist(neg)))

> covdf <- rbind(pos, neg)

> p <- ggplot(covdf, aes(Position, Coverage, fill=Strand)) +

+ geom_bar(stat="identity", position="identity") + facet_wrap(~Chr)

> p

chr1 chr2 chr3

−5

0

5

0 100 200 300 0 100 200 300 0 100 200 300Position

Cov

erag

e Strand

+

Graphics and Data Visualization in R Genome Graphics ggbio Slide 84/121

Page 85: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Circular Layout

> autoplot(gr, layout = "circle", aes(fill = seqnames))

seqnames

chr1

chr2

chr3

Graphics and Data Visualization in R Genome Graphics ggbio Slide 85/121

Page 86: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

More Complex Circular Example> seqlengths(gr) <- c(400, 500, 700)

> values(gr)$to.gr <- gr[sample(1:length(gr), size = length(gr))]

> idx <- sample(1:length(gr), size = 50)

> gr <- gr[idx]

> ggplot() + layout_circle(gr, geom = "ideo", fill = "gray70", radius = 7, trackWidth = 3) +

+ layout_circle(gr, geom = "bar", radius = 10, trackWidth = 4,

+ aes(fill = score, y = score)) +

+ layout_circle(gr, geom = "point", color = "red", radius = 14,

+ trackWidth = 3, grid = TRUE, aes(y = score)) +

+ layout_circle(gr, geom = "link", linked.to = "to.gr", radius = 6, trackWidth = 1)

● ●

● ●

50

100

score

Graphics and Data Visualization in R Genome Graphics ggbio Slide 86/121

Page 87: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Viewing Alignments and VariantsTo make the following example work, please download and unpack this data archive

Link containing GFF, BAM and VCF sample files.> library(rtracklayer); library(GenomicFeatures); library(Rsamtools); library(VariantAnnotation)

> ga <- readGAlignmentsFromBam("./data/SRR064167.fastq.bam", use.names=TRUE, param=ScanBamParam(which=GRanges("Chr5", IRanges(4000, 8000))))

> p1 <- autoplot(ga, geom = "rect")

> p2 <- autoplot(ga, geom = "line", stat = "coverage")

> vcf <- readVcf(file="data/varianttools_gnsap.vcf", genome="ATH1")

> p3 <- autoplot(vcf[seqnames(vcf)=="Chr5"], type = "fixed") + xlim(4000, 8000) + theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y=element_blank())

> txdb <- makeTranscriptDbFromGFF(file="./data/TAIR10_GFF3_trunc.gff", format="gff3")

> p4 <- autoplot(txdb, which=GRanges("Chr5", IRanges(4000, 8000)), names.expr = "gene_id")

> tracks(Reads=p1, Coverage=p2, Variant=p3, Transcripts=p4, heights = c(0.3, 0.2, 0.1, 0.35)) + ylab("")R

eads

Cov

erag

e

0

25

50

75

Var

iant C

T

ALTR

EF

Tran

scrip

ts

AT5G01010

AT5G01010

AT5G01010

AT5G01010

AT5G01015

AT5G01015

AT5G01020

3 kb 4 kb 5 kb 6 kb 7 kb 8 kb

Graphics and Data Visualization in R Genome Graphics ggbio Slide 87/121

Page 88: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Additional Sample Plots

autoplot demo Link

Graphics and Data Visualization in R Genome Graphics ggbio Slide 88/121

Page 89: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Genome Graphics Additional Genome Graphics Slide 89/121

Page 90: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Additional Packages for Visualizing Genome Data

Gviz Link

RCircos (Zhang et al., 2013) Link

Genome Graphs Link

genoPlotR Link

Graphics and Data Visualization in R Genome Graphics Additional Genome Graphics Slide 90/121

Page 91: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Clustering Slide 91/121

Page 92: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Clustering Background Slide 92/121

Page 93: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

What is Clustering?

Clustering is the classification/partitioning of data objects intosimilarity groups (clusters) according to a defined distancemeasure.

It is used in many fields, such as machine learning, datamining, pattern recognition, image analysis, genomics,systems biology, etc.

Machine learning typically regards data clustering as a form ofunsupervised learning.

Graphics and Data Visualization in R Clustering Background Slide 93/121

Page 94: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Why Clustering and Data Mining in R?

Efficient data structures and functions for clustering.

Efficient environment for algorithm prototyping andbenchmarking.

Comprehensive set of clustering and machine learning libraries.

Standard for data analysis in many areas.

Overview: Clustering Task View on CRAN

Graphics and Data Visualization in R Clustering Background Slide 94/121

Page 95: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Data Transformations Choice depends on data set!

Center & standardize1 Center: subtract vector mean from each value2 Standardize: devide by standard deviation

⇒ Mean = 0 and STDEV = 1

Center & scale with the scale() fuction1 Center: subtract vector mean from each value2 Scale: divide centered vector by their root mean square (rms)

xrms =

√√√√ 1

n − 1

n∑i=1

xi 2

⇒ Mean = 0 and STDEV = 1

Log transformation

Rank transformation: replace measured values by ranks

No transformation

Graphics and Data Visualization in R Clustering Background Slide 95/121

Page 96: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Distance Methods List of most common ones!

Euclidean distance for two profiles X and Y

d(X ,Y ) =

√√√√ n∑i=1

(xi − yi )2

Disadvantages: not scale invariant, not for negative correlations

Maximum, Manhattan, Canberra, binary, Minowski, ...

Correlation-based distance: 1− r

Pearson correlation coefficient (PCC)

r =n∑n

i=1 xiyi −∑n

i=1 xi∑n

i=1 yi√(∑n

i=1 x2i − (

∑ni=1 xi )

2)(∑n

i=1 y2i − (

∑ni=1 yi )

2)

Disadvantage: outlier sensitiveSpearman correlation coefficient (SCC)

Same calculation as PCC but with ranked values!

Graphics and Data Visualization in R Clustering Background Slide 96/121

Page 97: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

There Are Many more Distance Measures

If the distances among items are quantifiable, then clustering ispossible.

Choose the most accurate and meaningful distance measure for agiven field of application.

If uncertain then choose several distance measures and compare theresults.

Graphics and Data Visualization in R Clustering Background Slide 97/121

Page 98: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Cluster Linkage

Single Linkage

Complete Linkage

Average Linkage

Graphics and Data Visualization in R Clustering Background Slide 98/121

Page 99: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 99/121

Page 100: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Hierarchical Clustering Steps

1 Identify clusters (items) with closest distance

2 Join them to new clusters

3 Compute distance between clusters (items)

4 Return to step 1

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 100/121

Page 101: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Hierarchical Clustering Agglomerative Approach

g1 g2 g3 g4 g50.1

g1 g2 g3 g4 g50.1

0.4

g1 g2 g3 g4 g50.1

0.4

0.6

0.5

(a)

(b)

(c)

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 101/121

Page 102: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Hierarchical Clustering Result with Heatmap

t4 t5 t2 t1 t3

g3

g4

g8

g6

g2

g5

g1

g9

g7

g10

−1 0 1Row Z−Score

01

23

Color Keyand Histogram

Cou

nt

A heatmap is a color coded table. To visually identify patterns, the rows andcolumns of a heatmap are often sorted by hierarchical clustering trees.In case of gene expression data, the row tree usually represents the genes, thecolumn tree the treatments and the colors in the heat table represent theintensities or ratios of the underlying gene expression data set.

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 102/121

Page 103: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Hierarchical Clustering Approaches in R

1 Agglomerative approach (bottom-up)

hclust() and agnes()

2 Divisive approach (top-down)

diana()

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 103/121

Page 104: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Tree Cutting to Obtain Discrete Clusters

1 Node height in tree

2 Number of clusters

3 Search tree nodes by distance cutoff

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 104/121

Page 105: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Example: hclust and heatmap.2

> library(gplots)

> y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep="")))

> heatmap.2(y) # Shortcut to final result

t1 t2 t3 t5 t4

g39g81g31g22g66g41g5g77g58g19g96g32g71g20g60g56g89g92g73g47g61g38g12g86g33g25g42g17g28g90g51g91g34g69g16g94g46g75g36g37g68g57g53g62g59g85g100g88g26g70g24g93g67g15g30g44g54g23g83g65g21g8g52g1g78g29g63g49g95g55g9g97g18g80g50g7g2g76g10g11g3g40g6g99g98g43g79g4g64g84g74g35g27g82g48g14g87g45g13g72

−2 0 2Value

040

Color Keyand Histogram

Cou

nt

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 105/121

Page 106: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Example: Stepwise Approach with Tree Cutting> ## Row- and column-wise clustering

> hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete")

> hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete")

> ## Tree cutting

> mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9); mycolhc <- mycolhc[as.vector(mycl)]

> ## Plot heatmap

> mycol <- colorpanel(40, "darkblue", "yellow", "white") # or try redgreen(75)

> heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=mycol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc)

t1 t2 t5 t3 t4g79g64g84g20g60g4g45g82g27g35g74g14g98g6g99g8g21g93g65g24g48g100g63g87g49g85g29g59g80g88g26g70g30g11g67g52g9g95g40g1g78g3g55g61g47g73g15g81g32g71g7g41g2g22g10g76g97g18g50g53g19g96g5g66g77g25g17g42g31g58g12g38g62g28g33g68g86g46g72g92g56g89g44g83g13g54g23g39g34g94g75g43g16g69g36g37g91g57g51g90

−1.5 0 1Row Z−Score

Color Key

Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 106/121

Page 107: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Outline

Overview

Graphics EnvironmentsBase GraphicsGrid Graphicslatticeggplot2

Specialty Graphics

Genome GraphicsggbioAdditional Genome Graphics

ClusteringBackgroundHierarchical Clustering ExampleNon-Hierarchical Clustering Examples

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 107/121

Page 108: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

K-Means Clustering

1 Choose the number of k clusters

2 Randomly assign items to the k clusters

3 Calculate new centroid for each of the k clusters

4 Calculate the distance of all items to the k centroids

5 Assign items to closest centroid

6 Repeat until clusters assignments are stable

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 108/121

Page 109: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

K-Means

X

X

X

XX

X

X

X

X

(a)

(b)

(c)

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 109/121

Page 110: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Example: Clustering with kmeans Function

> km <- kmeans(t(scale(t(y))), 3)

> km$cluster

g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 g20 g21 g22 g23 g24 g25 g26 g27 g28 g29 g30 g31 g32 g33 g34 g35 g36 g37 g38 g39 g40 g41 g42 g43 g44

2 1 2 3 1 3 1 2 2 1 3 1 3 2 3 3 1 2 1 3 2 1 3 2 1 2 2 1 2 3 1 3 1 2 2 3 1 1 3 2 1 1 3 3

g45 g46 g47 g48 g49 g50 g51 g52 g53 g54 g55 g56 g57 g58 g59 g60 g61 g62 g63 g64 g65 g66 g67 g68 g69 g70 g71 g72 g73 g74 g75 g76 g77 g78 g79 g80 g81 g82 g83 g84 g85 g86 g87 g88

3 3 3 2 2 2 1 2 1 3 2 3 1 1 2 3 3 1 2 3 2 1 3 1 3 2 3 3 3 3 3 1 1 2 3 2 3 3 3 3 2 1 2 2

g89 g90 g91 g92 g93 g94 g95 g96 g97 g98 g99 g100

3 1 1 3 2 1 2 1 1 3 3 2

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 110/121

Page 111: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Fuzzy C-Means Clustering

1 In contrast to strict (hard) clustering approaches, fuzzy (soft)clustering methods allow multiple cluster memberships of theclustered items.

2 This is commonly achieved by assigning to each item a weightof belonging to each cluster.

3 Thus, items on the edge of a cluster, may be in the cluster toa lesser degree than items in the center of a cluster.

4 Typically, each item has as many coefficients (weights) asthere are clusters that sum up for each item to one.

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 111/121

Page 112: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Example: Fuzzy Clustering with fanny

> library(cluster) # Loads the cluster library.

> fannyy <- fanny(y, k=4, metric = "euclidean", memb.exp = 1.2)

> round(fannyy$membership, 2)[1:4,]

[,1] [,2] [,3] [,4]

g1 0.82 0.04 0.10 0.05

g2 0.82 0.05 0.12 0.01

g3 0.98 0.01 0.01 0.01

g4 0.03 0.82 0.03 0.12

> fannyy$clustering

g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 g20 g21 g22 g23 g24 g25 g26 g27 g28 g29 g30 g31 g32 g33 g34 g35 g36 g37 g38 g39 g40 g41 g42 g43 g44

1 1 1 2 3 4 1 1 1 1 1 3 4 4 2 4 3 1 2 2 4 3 4 4 3 4 4 3 3 1 3 2 3 1 4 4 4 3 2 1 3 3 4 4

g45 g46 g47 g48 g49 g50 g51 g52 g53 g54 g55 g56 g57 g58 g59 g60 g61 g62 g63 g64 g65 g66 g67 g68 g69 g70 g71 g72 g73 g74 g75 g76 g77 g78 g79 g80 g81 g82 g83 g84 g85 g86 g87 g88

2 4 2 4 1 1 3 1 3 4 1 2 3 3 3 2 2 3 3 4 4 2 2 3 4 1 2 1 2 4 4 1 3 1 4 1 2 4 4 4 3 3 1 3

g89 g90 g91 g92 g93 g94 g95 g96 g97 g98 g99 g100

2 3 3 2 4 3 1 2 1 4 4 4

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 112/121

Page 113: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Principal Component Analysis (PCA)

Principal components analysis (PCA) is a data reduction techniquethat allows to simplify multidimensional data sets to 2 or 3dimensions for plotting purposes and visual variance analysis.

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 113/121

Page 114: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Basic PCA Steps

Center (and standardize) data

First principal component axis

Accross centroid of data cloudDistance of each point to that line is minimized, so that itcrosses the maximum variation of the data cloud

Second principal component axis

Orthogonal to first principal componentAlong maximum variation in the data

1st PCA axis becomes x-axis and 2nd PCA axis y-axis

Continue process until the necessary number of principalcomponents is obtained

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 114/121

Page 115: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

PCA on Two-Dimensional Data Set

1st

2nd

1st

2nd

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 115/121

Page 116: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Identifies the Amount of Variability between Components

Example

Principal Component 1st 2nd 3rd OtherProportion of Variance 62% 34% 3% rest

1st and 2nd principal components explain 96% of variance.

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 116/121

Page 117: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Example: PCA

> pca <- prcomp(y, scale=T)

> summary(pca) # Prints variance summary for all principal components

Importance of components:

PC1 PC2 PC3 PC4 PC5

Standard deviation 1.0996 1.0505 1.0247 0.9219 0.8873

Proportion of Variance 0.2418 0.2207 0.2100 0.1700 0.1575

Cumulative Proportion 0.2418 0.4626 0.6725 0.8425 1.0000

> plot(pca$x, pch=20, col="blue", type="n") # To plot dots, drop type="n"

> text(pca$x, rownames(pca$x), cex=0.8)

−2 −1 0 1 2 3

−2

−1

01

2

PC1

PC

2

g1

g2

g3

g4

g5

g6

g7

g8

g9

g10

g11

g12

g13

g14

g15

g16

g17

g18

g19

g20g21

g22

g23

g24g25

g26

g27

g28g29g30

g31

g32

g33

g34

g35

g36g37

g38

g39

g40

g41

g42

g43

g44

g45

g46

g47

g48

g49

g50

g51

g52

g53

g54

g55

g56

g57

g58

g59

g60g61

g62

g63

g64

g65

g66

g67

g68

g69

g70

g71

g72

g73

g74g75

g76

g77g78

g79

g80

g81

g82

g83

g84

g85

g86

g87

g88

g89

g90

g91

g92

g93

g94 g95

g96

g97

g98

g99

g100

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 117/121

Page 118: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Multidimensional Scaling (MDS)

Alternative dimensionality reduction approach

Represents distances in 2D or 3D space

Starts from distance matrix (PCA uses data points)

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 118/121

Page 119: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Example: MDS with cmdscale

The following example performs MDS analysis on the geographic distances among European cities.

> loc <- cmdscale(eurodist)

> plot(loc[,1], -loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)")

> text(loc[,1], -loc[,2], rownames(loc), cex=0.8)

−2000 −1000 0 1000 2000

−10

000

1000

cmdscale(eurodist)

Athens

Barcelona

BrusselsCalaisCherbourg

Cologne

Copenhagen

Geneva

Gibraltar

Hamburg

Hook of Holland

LisbonLyons

MadridMarseilles Milan

Munich

Paris

Rome

Stockholm

Vienna

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 119/121

Page 120: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Biclustering

Finds in matrix subgroups of rows and columns which are as similar aspossible to each other and as different as possible to the remaining datapoints.

Unclustered ⇒ Clustered

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 120/121

Page 121: Graphics and Data Visualization in R - Overviewfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rgraphics/Rgraphics.pdf · Graphics and Data Visualization in R Graphics Environments

Bibliography I

Yin, T., Cook, D., Lawrence, M., Aug 2012. ggbio: an R package for extending thegrammar of graphics for genomic data. Genome Biol 13 (8).URL http://www.hubmed.org/display.cgi?uids=22937822

Zhang, H., Meltzer, P., Davis, S., 2013. RCircos: an R package for Circos 2D trackplots. BMC Bioinformatics 14, 244–244.URL http://www.hubmed.org/display.cgi?uids=23937229

Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 121/121