ggplot2 for Epi Studies - University of North Carolina at

Preview:

Citation preview

ggplot2 for Epi StudiesLeah McGrath, PhD

November 13, 2017

Introduction

Know your data: data exploration is an important part of research

Data visualization is an excellent way to explore data

ggplot2 is an elegant R library that makes it easy to createcompelling graphs

plots can be iteratively built up and easily modified

·

·

·

·

2/42

Learning objectives

To create graphs used in manuscripts for epidemiology studies

To review and incorporate previously learned aspects of formattinggraphs

To demonstrate novel data visualizations using Shiny

·

·

·

3/42

ggplot architecture review

Aesthetics: specify the variables to display

“geoms”: specify type of plot

Scales: for transforming variables(e.g., log, sq. root).

Facets: creating separate panels for different factors

Themes: Adjust appearance: background, fonts, etc

·

what are x and y?

can also link variables to color, shape, size and transparency

-

-

·

do you want a scatter plot, line, bars, densities, or other typeplot?

-

·

also used to set legend – title, breaks, labels-

·

·

4/42

Hemoglobin data

Data from the National Health and Nutritional Examination Survey(NHANES) dataset, 1999-2000

containing data about n=3,990 patients

The file was created by merging demographic data with completeblood count file, and nutritional biochemistry lab file.

Contains measures hemoglobin, iron status, and other anemia-related parameters

·

·

·

·

5/42

Anemia data codebook

age = age in years of participant (years)

sex = sex of participant (Male vs Female)

tsat = transferrin saturation (%)

iron = total serum iron (ug/dL)

hgb = hemoglobin concentration (g/dL)

ferr = serum ferritin (mg/mL)

folate = serum folate (mg/mL)

race = participant race (Hispanic, White, Black, Other)

rdw = red cell distribution width (%)

wbc = white blood cell count (SI)

anemia = indicator variable for anemia (according to WHOdefinition)

·

·

·

·

·

·

·

·

·

·

·

6/42

Scatter plot review: hemoglobin by age,stratified by ethnicity and sex

ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw()

7/42

Scatter plot review: hemoglobin by age,stratified by ethnicity and sex

8/42

Box plots

ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot()

9/42

Box plots with points

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1)

10/42

Box plots with coordinates flipped

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip()

11/42

Violin plots

Kernal density estimates that are placed on each side and mirroredso it forms a symmetrical shape

Easy to compare several distributions

·

·

12/42

Violin plots

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()

13/42

Violin plots with underlying data points

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1)

14/42

Violin plots stratified by 2 variables

ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin()

15/42

Violin plots & boxplot with no outliers

ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)

16/42

Practice

Use the anemia dataset to practice making scatterplots, boxplots, and violin plots

Try faceting, flipping orientation, changing colors and labels

·

·

str(anemia)

## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ...

17/42

Forest plots

First gather the data into the proper format including the followingvariables:

·

Estimate

Lower CI

Upper CI

Grouping variable

-

-

-

-

18/42

Forest plots

For this example, we take the mean and calculate the upper andlower confidence interval for hemoglobin.

We will stack the row observations into one variable called "Type".

·

·

anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2)

19/42

Forest plots

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange()

20/42

Forest plots: flip the axes, add labels

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw()

21/42

Forest plots: calculating mean and CI withinggplot

ggplot can calculate the mean and CI using stat_summary

Further data manipulation would be needed to stack multiplevariables

·

·

22/42

Calculating mean and CI within ggplot

ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)")

23/42

Forest plots: adding faceting

ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free')

24/42

25/42

Practice

Use the anemia dataset to practice making forest plots using othercontinuous variables

Use dplyr to create a new, categorized age variable (hint: factor thisbefore graphing). Create a forest plot of mean hemoglobin by agecategory.

·

·

26/42

Kaplan-Meier plots - WIHS data

Women’s Interagency HIV Study (WIHS) is an ongoing observationalcohort study with semiannual visits at 10 sites in the US

Data on 1,164 patients who were HIV-positive, free of clinical AIDS,and not on antiretroviral therapy (ART) at study baseline (Dec. 6,1995)

Contains measures information on age, race, CD4 count, drug use,ARV treatment, and time to aids/death

·

·

·

27/42

Kaplan-Meier plots

MANY package options to plot survival functions

All use the survival package to calculate survival over time

Allows for multiple treatments and subgroups

Does not take into account competing risks

·

·

survfit(survival) + survplot(rms)

ggkm(sachsmc/ggkm) & ggplot2

ggkm(michaelway/ggkm)

-

-

-

·

·

28/42

Kaplan-Meier example 1

Calculate KM within ggplot

https://github.com/sachsmc/ggkm

Prep data

·

·

·

wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time))

29/42

KM plot within ggplot2

devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km()

30/42

KM by treatment group

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km()

31/42

Add confidence bands

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband()

32/42

KM example #2

Calculated using survival package

Plots KM curve with numbers at risk

Same package name as previous example!

https://github.com/michaelway/ggkm

·

·

·

·

remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm)

33/42

KM example 2

fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit)

34/42

KM with numbers at risk

ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU"))

35/42

Cumulative incidence plots

1-survival probability

ipwrisk package - coming soon!

·

·

calculates adjusted cumulative incidence curves using IPTW

addresses censoring (IPCW) and competing risks

produces tables and graphics

-

-

-

36/42

Sankey diagram

Visualization that shows the flow of patients between states (overtime)

States, or nodes, can be treatments, comorbidities, hospitalizationsetc.

Paths connecting states are called links - proportion corresponds tothickness of line

Example: https://vizhub.healthdata.org/dex/

·

·

·

·

37/42

Basic sankey diagrams in R

library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5)

38/42

Basic sankey diagrams in R

Renal Failure

Hemodialysis at 6m

Transplant at 6m

Death by 6m

Hemodialysis at 12m

Transplant at 12m

Death by 12m

39/42

Final Tips

Spend time planning your graph

Make sure to have the data in the correct structure before you startgraphing

Start with a simple graph, gradually build in complexity

·

·

·

40/42

Further reading

ggplot2: http://docs.ggplot2.org/current/

Cookbook for R: http://www.cookbook-r.com/Graphs/

Quick-R: http://www.statmethods.net/index.html

·

·

·

41/42

Wrap-up

Questions?

Acknowledgements: Alan Brookhart, Sara Levintow

Contact info: lmcgrath@novisci.com

·

·

·

42/42