ggplot2 for Epi Studies - University of North Carolina at

ggplot2 for Epi StudiesLeah McGrath, PhD

November 13, 2017

Introduction

Know your data: data exploration is an important part of research

Data visualization is an excellent way to explore data

ggplot2 is an elegant R library that makes it easy to createcompelling graphs

plots can be iteratively built up and easily modified

Learning objectives

To create graphs used in manuscripts for epidemiology studies

To review and incorporate previously learned aspects of formattinggraphs

To demonstrate novel data visualizations using Shiny

ggplot architecture review

Aesthetics: specify the variables to display

“geoms”: specify type of plot

Scales: for transforming variables(e.g., log, sq. root).

Facets: creating separate panels for different factors

Themes: Adjust appearance: background, fonts, etc

what are x and y?

can also link variables to color, shape, size and transparency

do you want a scatter plot, line, bars, densities, or other typeplot?

also used to set legend – title, breaks, labels-

Hemoglobin data

Data from the National Health and Nutritional Examination Survey(NHANES) dataset, 1999-2000

containing data about n=3,990 patients

The file was created by merging demographic data with completeblood count file, and nutritional biochemistry lab file.

Contains measures hemoglobin, iron status, and other anemia-related parameters

Anemia data codebook

age = age in years of participant (years)

sex = sex of participant (Male vs Female)

tsat = transferrin saturation (%)

iron = total serum iron (ug/dL)

hgb = hemoglobin concentration (g/dL)

ferr = serum ferritin (mg/mL)

folate = serum folate (mg/mL)

race = participant race (Hispanic, White, Black, Other)

rdw = red cell distribution width (%)

wbc = white blood cell count (SI)

anemia = indicator variable for anemia (according to WHOdefinition)

Scatter plot review: hemoglobin by age,stratified by ethnicity and sex

ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw()

Scatter plot review: hemoglobin by age,stratified by ethnicity and sex

Box plots

ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot()

Box plots with points

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1)

Box plots with coordinates flipped

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip()

Violin plots

Kernal density estimates that are placed on each side and mirroredso it forms a symmetrical shape

Easy to compare several distributions

Violin plots

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()

Violin plots with underlying data points

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1)

Violin plots stratified by 2 variables

ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin()

Violin plots & boxplot with no outliers

ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)

Practice

Use the anemia dataset to practice making scatterplots, boxplots, and violin plots

Try faceting, flipping orientation, changing colors and labels

str(anemia)

## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ...

Forest plots

First gather the data into the proper format including the followingvariables:

Estimate

Lower CI

Upper CI

Grouping variable

Forest plots

For this example, we take the mean and calculate the upper andlower confidence interval for hemoglobin.

We will stack the row observations into one variable called "Type".

anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2)

Forest plots

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange()

Forest plots: flip the axes, add labels

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw()

Forest plots: calculating mean and CI withinggplot

ggplot can calculate the mean and CI using stat_summary

Further data manipulation would be needed to stack multiplevariables

Calculating mean and CI within ggplot

ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)")

Forest plots: adding faceting

ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free')

Practice

Use the anemia dataset to practice making forest plots using othercontinuous variables

Use dplyr to create a new, categorized age variable (hint: factor thisbefore graphing). Create a forest plot of mean hemoglobin by agecategory.

Kaplan-Meier plots - WIHS data

Women’s Interagency HIV Study (WIHS) is an ongoing observationalcohort study with semiannual visits at 10 sites in the US

Data on 1,164 patients who were HIV-positive, free of clinical AIDS,and not on antiretroviral therapy (ART) at study baseline (Dec. 6,1995)

Contains measures information on age, race, CD4 count, drug use,ARV treatment, and time to aids/death

Kaplan-Meier plots

MANY package options to plot survival functions

All use the survival package to calculate survival over time

Allows for multiple treatments and subgroups

Does not take into account competing risks

survfit(survival) + survplot(rms)

ggkm(sachsmc/ggkm) & ggplot2

ggkm(michaelway/ggkm)

Kaplan-Meier example 1

Calculate KM within ggplot

https://github.com/sachsmc/ggkm

Prep data

wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time))

KM plot within ggplot2

devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km()

KM by treatment group

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km()

Add confidence bands

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband()

KM example #2

Calculated using survival package

Plots KM curve with numbers at risk

Same package name as previous example!

https://github.com/michaelway/ggkm

remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm)

KM example 2

fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit)

KM with numbers at risk

ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU"))

Cumulative incidence plots

1-survival probability

ipwrisk package - coming soon!

calculates adjusted cumulative incidence curves using IPTW

addresses censoring (IPCW) and competing risks

produces tables and graphics

Sankey diagram

Visualization that shows the flow of patients between states (overtime)

States, or nodes, can be treatments, comorbidities, hospitalizationsetc.

Paths connecting states are called links - proportion corresponds tothickness of line

Example: https://vizhub.healthdata.org/dex/

Basic sankey diagrams in R

library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5)

Basic sankey diagrams in R

Renal Failure

Hemodialysis at 6m

Transplant at 6m

Death by 6m

Hemodialysis at 12m

Transplant at 12m

Death by 12m

Final Tips

Spend time planning your graph

Make sure to have the data in the correct structure before you startgraphing

Start with a simple graph, gradually build in complexity

ggplot2 for Epi Studies - University of North Carolina at

Documents

Chapter 2 R ggplot2 Examples

Package 'ggplot2

Tutorial GGPlot2

Package 'ggplot2' - CRAN

GGPlot2 L’Essentiel - Datanovia

Ggplot2 v3

ggplot2 : Understanding the Grammar of Graphics

R Handouts 2019-20 Data Visualization with ggplot2 handout... · R Handouts 2019-20 Data Visualization with ggplot2 R handout Spring 2020 Data Visualization w ggplot2.docx Page 10of

ggplot2 @ statistics.com Week 2 Dope Sheet Page 1rpruim/talks/ggplot2/2014-05-30-Great...ggplot2 @ statistics.com Week 2 Dope Sheet Page 3 1.3 Multiple layers Plots may have multiple

The ggplot2 Cheat Sheet

Visualisation with ggplot2

Advanced Plotting with ggplot2 - uni-freiburg.de

Article type: Focus Article ggplot2 593 › papers › ggplot2-wires.pdf · Programming graphics As well as the underlying grammar, another special feature of ggplot2 is that it is

Visualization & ggplot2 Lab - Stanford University

Assignment: ggplot2 - cnaimancnaiman.com/DataViz/Homework/ggplot2-HW.pdf · Assignment: ggplot2 ... most likely have to use the help features in RStudio and you will no doubt find

Ggplot2 Intro

Handout Ggplot2

Ggplot2 1outof3

ggplot2 Essentials - Sample Chapter

R-ggplot2 package Examples