Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ggplot2 for Epi StudiesLeah McGrath, PhD
November 13, 2017
Introduction
Know your data: data exploration is an important part of research
Data visualization is an excellent way to explore data
ggplot2 is an elegant R library that makes it easy to createcompelling graphs
plots can be iteratively built up and easily modified
·
·
·
·
2/42
Learning objectives
To create graphs used in manuscripts for epidemiology studies
To review and incorporate previously learned aspects of formattinggraphs
To demonstrate novel data visualizations using Shiny
·
·
·
3/42
ggplot architecture review
Aesthetics: specify the variables to display
“geoms”: specify type of plot
Scales: for transforming variables(e.g., log, sq. root).
Facets: creating separate panels for different factors
Themes: Adjust appearance: background, fonts, etc
·
what are x and y?
can also link variables to color, shape, size and transparency
-
-
·
do you want a scatter plot, line, bars, densities, or other typeplot?
-
·
also used to set legend – title, breaks, labels-
·
·
4/42
Hemoglobin data
Data from the National Health and Nutritional Examination Survey(NHANES) dataset, 1999-2000
containing data about n=3,990 patients
The file was created by merging demographic data with completeblood count file, and nutritional biochemistry lab file.
Contains measures hemoglobin, iron status, and other anemia-related parameters
·
·
·
·
5/42
Anemia data codebook
age = age in years of participant (years)
sex = sex of participant (Male vs Female)
tsat = transferrin saturation (%)
iron = total serum iron (ug/dL)
hgb = hemoglobin concentration (g/dL)
ferr = serum ferritin (mg/mL)
folate = serum folate (mg/mL)
race = participant race (Hispanic, White, Black, Other)
rdw = red cell distribution width (%)
wbc = white blood cell count (SI)
anemia = indicator variable for anemia (according to WHOdefinition)
·
·
·
·
·
·
·
·
·
·
·
6/42
Scatter plot review: hemoglobin by age,stratified by ethnicity and sex
ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw()
7/42
Scatter plot review: hemoglobin by age,stratified by ethnicity and sex
8/42
Box plots
ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot()
9/42
Box plots with points
ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1)
10/42
Box plots with coordinates flipped
ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip()
11/42
Violin plots
Kernal density estimates that are placed on each side and mirroredso it forms a symmetrical shape
Easy to compare several distributions
·
·
12/42
Violin plots
ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()
13/42
Violin plots with underlying data points
ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1)
14/42
Violin plots stratified by 2 variables
ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin()
15/42
Violin plots & boxplot with no outliers
ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)
16/42
Practice
Use the anemia dataset to practice making scatterplots, boxplots, and violin plots
Try faceting, flipping orientation, changing colors and labels
·
·
str(anemia)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ...
17/42
Forest plots
First gather the data into the proper format including the followingvariables:
·
Estimate
Lower CI
Upper CI
Grouping variable
-
-
-
-
18/42
Forest plots
For this example, we take the mean and calculate the upper andlower confidence interval for hemoglobin.
We will stack the row observations into one variable called "Type".
·
·
anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2)
19/42
Forest plots
ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange()
20/42
Forest plots: flip the axes, add labels
ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw()
21/42
Forest plots: calculating mean and CI withinggplot
ggplot can calculate the mean and CI using stat_summary
Further data manipulation would be needed to stack multiplevariables
·
·
22/42
Calculating mean and CI within ggplot
ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)")
23/42
Forest plots: adding faceting
ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free')
24/42
25/42
Practice
Use the anemia dataset to practice making forest plots using othercontinuous variables
Use dplyr to create a new, categorized age variable (hint: factor thisbefore graphing). Create a forest plot of mean hemoglobin by agecategory.
·
·
26/42
Kaplan-Meier plots - WIHS data
Women’s Interagency HIV Study (WIHS) is an ongoing observationalcohort study with semiannual visits at 10 sites in the US
Data on 1,164 patients who were HIV-positive, free of clinical AIDS,and not on antiretroviral therapy (ART) at study baseline (Dec. 6,1995)
Contains measures information on age, race, CD4 count, drug use,ARV treatment, and time to aids/death
·
·
·
27/42
Kaplan-Meier plots
MANY package options to plot survival functions
All use the survival package to calculate survival over time
Allows for multiple treatments and subgroups
Does not take into account competing risks
·
·
survfit(survival) + survplot(rms)
ggkm(sachsmc/ggkm) & ggplot2
ggkm(michaelway/ggkm)
-
-
-
·
·
28/42
Kaplan-Meier example 1
Calculate KM within ggplot
https://github.com/sachsmc/ggkm
Prep data
·
·
·
wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time))
29/42
KM plot within ggplot2
devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km()
30/42
KM by treatment group
ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km()
31/42
Add confidence bands
ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband()
32/42
KM example #2
Calculated using survival package
Plots KM curve with numbers at risk
Same package name as previous example!
https://github.com/michaelway/ggkm
·
·
·
·
remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm)
33/42
KM example 2
fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit)
34/42
KM with numbers at risk
ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU"))
35/42
Cumulative incidence plots
1-survival probability
ipwrisk package - coming soon!
·
·
calculates adjusted cumulative incidence curves using IPTW
addresses censoring (IPCW) and competing risks
produces tables and graphics
-
-
-
36/42
Sankey diagram
Visualization that shows the flow of patients between states (overtime)
States, or nodes, can be treatments, comorbidities, hospitalizationsetc.
Paths connecting states are called links - proportion corresponds tothickness of line
Example: https://vizhub.healthdata.org/dex/
·
·
·
·
37/42
Basic sankey diagrams in R
library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5)
38/42
Basic sankey diagrams in R
Renal Failure
Hemodialysis at 6m
Transplant at 6m
Death by 6m
Hemodialysis at 12m
Transplant at 12m
Death by 12m
39/42
Final Tips
Spend time planning your graph
Make sure to have the data in the correct structure before you startgraphing
Start with a simple graph, gradually build in complexity
·
·
·
40/42
Further reading
ggplot2: http://docs.ggplot2.org/current/
Cookbook for R: http://www.cookbook-r.com/Graphs/
Quick-R: http://www.statmethods.net/index.html
·
·
·
41/42
Wrap-up
Questions?
Acknowledgements: Alan Brookhart, Sara Levintow
Contact info: [email protected]
·
·
·
42/42