42
ggplot2 for Epi Studies Leah McGrath, PhD November 13, 2017

ggplot2 for Epi Studies - University of North Carolina at

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ggplot2 for Epi Studies - University of North Carolina at

ggplot2 for Epi StudiesLeah McGrath, PhD

November 13, 2017

Page 2: ggplot2 for Epi Studies - University of North Carolina at

Introduction

Know your data: data exploration is an important part of research

Data visualization is an excellent way to explore data

ggplot2 is an elegant R library that makes it easy to createcompelling graphs

plots can be iteratively built up and easily modified

·

·

·

·

2/42

Page 3: ggplot2 for Epi Studies - University of North Carolina at

Learning objectives

To create graphs used in manuscripts for epidemiology studies

To review and incorporate previously learned aspects of formattinggraphs

To demonstrate novel data visualizations using Shiny

·

·

·

3/42

Page 4: ggplot2 for Epi Studies - University of North Carolina at

ggplot architecture review

Aesthetics: specify the variables to display

“geoms”: specify type of plot

Scales: for transforming variables(e.g., log, sq. root).

Facets: creating separate panels for different factors

Themes: Adjust appearance: background, fonts, etc

·

what are x and y?

can also link variables to color, shape, size and transparency

-

-

·

do you want a scatter plot, line, bars, densities, or other typeplot?

-

·

also used to set legend – title, breaks, labels-

·

·

4/42

Page 5: ggplot2 for Epi Studies - University of North Carolina at

Hemoglobin data

Data from the National Health and Nutritional Examination Survey(NHANES) dataset, 1999-2000

containing data about n=3,990 patients

The file was created by merging demographic data with completeblood count file, and nutritional biochemistry lab file.

Contains measures hemoglobin, iron status, and other anemia-related parameters

·

·

·

·

5/42

Page 6: ggplot2 for Epi Studies - University of North Carolina at

Anemia data codebook

age = age in years of participant (years)

sex = sex of participant (Male vs Female)

tsat = transferrin saturation (%)

iron = total serum iron (ug/dL)

hgb = hemoglobin concentration (g/dL)

ferr = serum ferritin (mg/mL)

folate = serum folate (mg/mL)

race = participant race (Hispanic, White, Black, Other)

rdw = red cell distribution width (%)

wbc = white blood cell count (SI)

anemia = indicator variable for anemia (according to WHOdefinition)

·

·

·

·

·

·

·

·

·

·

·

6/42

Page 7: ggplot2 for Epi Studies - University of North Carolina at

Scatter plot review: hemoglobin by age,stratified by ethnicity and sex

ggplot(data=anemia, aes(x=age,y=hgb,color=sex)) + geom_smooth() + geom_jitter(aes(size=1/iron), alpha=0.1) + xlab("Age")+ylab("Hemoglobin (g/dl)") + scale_size(name = "Iron Deficiency") + scale_color_discrete(name = "Sex") + facet_wrap(~race)+theme_bw()

7/42

Page 8: ggplot2 for Epi Studies - University of North Carolina at

Scatter plot review: hemoglobin by age,stratified by ethnicity and sex

8/42

Page 9: ggplot2 for Epi Studies - University of North Carolina at

Box plots

ggplot(data=anemia, aes(x=race,y=hgb)) + geom_boxplot()

9/42

Page 10: ggplot2 for Epi Studies - University of North Carolina at

Box plots with points

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1)

10/42

Page 11: ggplot2 for Epi Studies - University of North Carolina at

Box plots with coordinates flipped

ggplot(data=anemia, aes(x=race,y=hgb,color=sex)) + geom_boxplot()+ geom_jitter(alpha=0.1) + coord_flip()

11/42

Page 12: ggplot2 for Epi Studies - University of North Carolina at

Violin plots

Kernal density estimates that are placed on each side and mirroredso it forms a symmetrical shape

Easy to compare several distributions

·

·

12/42

Page 13: ggplot2 for Epi Studies - University of North Carolina at

Violin plots

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()

13/42

Page 14: ggplot2 for Epi Studies - University of North Carolina at

Violin plots with underlying data points

ggplot(data=anemia, aes(x=race,y=hgb,color=race)) + geom_violin()+ geom_jitter(alpha=0.1)

14/42

Page 15: ggplot2 for Epi Studies - University of North Carolina at

Violin plots stratified by 2 variables

ggplot(data=anemia, aes(x=sex,y=hgb,color=race)) + geom_violin()

15/42

Page 16: ggplot2 for Epi Studies - University of North Carolina at

Violin plots & boxplot with no outliers

ggplot(data=anemia, aes(x=race,y=hgb, color=race)) + geom_violin() + geom_boxplot(width=.1, fill="black", outlier.color=NA) + stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=2.5)

16/42

Page 17: ggplot2 for Epi Studies - University of North Carolina at

Practice

Use the anemia dataset to practice making scatterplots, boxplots, and violin plots

Try faceting, flipping orientation, changing colors and labels

·

·

str(anemia)

## Classes 'tbl_df', 'tbl' and 'data.frame': 3990 obs. of 13 variables: ## $ age : num 77 49 59 43 37 70 81 38 85 23 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 2 2 2 ... ## $ tsat : num 16.3 41.5 27.6 28 19.7 18.5 16.9 27.1 13.4 35.8 ... ## $ iron : num 65 141 96 83 64 75 65 97 38 136 ... ## $ hgb : num 14.1 14.5 13.4 15.4 16 16.8 16.6 13.3 10.9 14.5 ... ## $ ferr : num 55 198 155 32 68 87 333 33 166 48 ... ## $ folate: num 24.6 17.1 12.2 13.5 23 46.9 14.6 6.1 30.3 19.9 ... ## $ vite : num 1488 1897 1311 528 3092 ... ## $ vita : num 74.9 84.6 54 41.9 72.5 ... ## $ race : Factor w/ 4 levels "Hispanic","White",..: 2 2 3 3 2 1 2 2 3 1 ... ## $ rdw : num 13.7 13.1 14.3 13.7 13.6 14.4 12.4 11.9 14.1 11.4 ... ## $ wbc : num 7.6 5.9 4.9 4.6 10.2 11.6 9.1 7.6 7.4 5.6 ... ## $ anemia: num 0 0 0 0 0 0 0 0 1 0 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:805] 26 28 32 33 36 37 38 39 45 54 ... ## .. ..- attr(*, "names")= chr [1:805] "26" "28" "32" "33" ...

17/42

Page 18: ggplot2 for Epi Studies - University of North Carolina at

Forest plots

First gather the data into the proper format including the followingvariables:

·

Estimate

Lower CI

Upper CI

Grouping variable

-

-

-

-

18/42

Page 19: ggplot2 for Epi Studies - University of North Carolina at

Forest plots

For this example, we take the mean and calculate the upper andlower confidence interval for hemoglobin.

We will stack the row observations into one variable called "Type".

·

·

anemia1 <- anemia %>% select(sex,hgb) %>% group_by(sex) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia1)[1] <- "Type" anemia2 <- anemia %>% select(race,hgb) %>% group_by(race) %>% summarise_all(funs("mean",n(),lower=(mean-((sd(.)/sqrt(n()))*1.96)), upper=(mean+((sd(.)/sqrt(n()))*1.96)))) colnames(anemia2)[1] <- "Type" anemia3 <- rbind(anemia1,anemia2)

19/42

Page 20: ggplot2 for Epi Studies - University of North Carolina at

Forest plots

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange()

20/42

Page 21: ggplot2 for Epi Studies - University of North Carolina at

Forest plots: flip the axes, add labels

ggplot(data=anemia3, aes(x=Type, y=mean, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)") + theme_bw()

21/42

Page 22: ggplot2 for Epi Studies - University of North Carolina at

Forest plots: calculating mean and CI withinggplot

ggplot can calculate the mean and CI using stat_summary

Further data manipulation would be needed to stack multiplevariables

·

·

22/42

Page 23: ggplot2 for Epi Studies - University of North Carolina at

Calculating mean and CI within ggplot

ggplot(anemia, aes(x=race, y=hgb)) + stat_summary(fun.data=mean_cl_normal) + coord_flip() + theme_bw() + xlab("Demographics") + ylab("Mean Hemoglobin (95% CI)")

23/42

Page 24: ggplot2 for Epi Studies - University of North Carolina at

Forest plots: adding faceting

ggplot(any.fit3, aes(x=V3, y=A1, ymin=lower, ymax=upper)) + geom_pointrange(shape=20) + coord_flip() + xlab("Predictor Variable") + ylab("Adjusted Risk Difference per 100 (95% CI)") + scale_y_continuous(breaks=c(-20,-15,-10,-5,0,5,10,15,20,25), limits = c(-21,26)) + theme_bw() + geom_hline(yintercept=0, lty=2) + facet_grid(setting~., scales= 'free', space='free')

24/42

Page 25: ggplot2 for Epi Studies - University of North Carolina at

25/42

Page 26: ggplot2 for Epi Studies - University of North Carolina at

Practice

Use the anemia dataset to practice making forest plots using othercontinuous variables

Use dplyr to create a new, categorized age variable (hint: factor thisbefore graphing). Create a forest plot of mean hemoglobin by agecategory.

·

·

26/42

Page 27: ggplot2 for Epi Studies - University of North Carolina at

Kaplan-Meier plots - WIHS data

Women’s Interagency HIV Study (WIHS) is an ongoing observationalcohort study with semiannual visits at 10 sites in the US

Data on 1,164 patients who were HIV-positive, free of clinical AIDS,and not on antiretroviral therapy (ART) at study baseline (Dec. 6,1995)

Contains measures information on age, race, CD4 count, drug use,ARV treatment, and time to aids/death

·

·

·

27/42

Page 28: ggplot2 for Epi Studies - University of North Carolina at

Kaplan-Meier plots

MANY package options to plot survival functions

All use the survival package to calculate survival over time

Allows for multiple treatments and subgroups

Does not take into account competing risks

·

·

survfit(survival) + survplot(rms)

ggkm(sachsmc/ggkm) & ggplot2

ggkm(michaelway/ggkm)

-

-

-

·

·

28/42

Page 29: ggplot2 for Epi Studies - University of North Carolina at

Kaplan-Meier example 1

Calculate KM within ggplot

https://github.com/sachsmc/ggkm

Prep data

·

·

·

wihs$outcome <- ifelse(is.na(wihs$art),0,1) wihs$time <- ifelse(is.na(wihs$aids_death_art), wihs$dropout,wihs$aids_death_art) wihs <- wihs %>% mutate(time = ifelse(is.na(time),study_end,time))

29/42

Page 30: ggplot2 for Epi Studies - University of North Carolina at

KM plot within ggplot2

devtools::install_github("sachsmc/ggkm") library(ggkm) ggplot(wihs, aes(time = time, status = outcome)) + geom_km()

30/42

Page 31: ggplot2 for Epi Studies - University of North Carolina at

KM by treatment group

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km()

31/42

Page 32: ggplot2 for Epi Studies - University of North Carolina at

Add confidence bands

ggplot(wihs, aes(time = time, status = outcome, color = factor(idu))) + geom_km() + geom_kmband()

32/42

Page 33: ggplot2 for Epi Studies - University of North Carolina at

KM example #2

Calculated using survival package

Plots KM curve with numbers at risk

Same package name as previous example!

https://github.com/michaelway/ggkm

·

·

·

·

remove.packages("ggkm") install_github("michaelway/ggkm") library(ggkm)

33/42

Page 34: ggplot2 for Epi Studies - University of North Carolina at

KM example 2

fit <- survfit(Surv(time,outcome)~idu, data=wihs) ggkm(fit)

34/42

Page 35: ggplot2 for Epi Studies - University of North Carolina at

KM with numbers at risk

ggkm(fit, table=TRUE, marks = FALSE, ystratalabs = c("No IDU", "History of IDU"))

35/42

Page 36: ggplot2 for Epi Studies - University of North Carolina at

Cumulative incidence plots

1-survival probability

ipwrisk package - coming soon!

·

·

calculates adjusted cumulative incidence curves using IPTW

addresses censoring (IPCW) and competing risks

produces tables and graphics

-

-

-

36/42

Page 37: ggplot2 for Epi Studies - University of North Carolina at

Sankey diagram

Visualization that shows the flow of patients between states (overtime)

States, or nodes, can be treatments, comorbidities, hospitalizationsetc.

Paths connecting states are called links - proportion corresponds tothickness of line

Example: https://vizhub.healthdata.org/dex/

·

·

·

·

37/42

Page 38: ggplot2 for Epi Studies - University of North Carolina at

Basic sankey diagrams in R

library(networkD3) library(reshape2) library(magrittr) nodes <- data.frame(name=c("Renal Failure", "Hemodialysis at 6m", "Transplant at 6m", "Death by 6m", "Hemodialysis at 12m", "Transplant at 12m", "Death by 12m")) links <- data.frame(source=c(0,0,0,1,1,1,2,2,2,3), target=c(1,2,3,4,5,6,4,5,6,6), value=c(70,20,10,40,20,10,15,4,1,10)) sankeyNetwork(Links = links, Nodes = nodes, Source = "source", Target = "target", Value = "value", NodeID ="name", fontSize = 22, nodeWidth = 30,nodePadding = 5)

38/42

Page 39: ggplot2 for Epi Studies - University of North Carolina at

Basic sankey diagrams in R

Renal Failure

Hemodialysis at 6m

Transplant at 6m

Death by 6m

Hemodialysis at 12m

Transplant at 12m

Death by 12m

39/42

Page 40: ggplot2 for Epi Studies - University of North Carolina at

Final Tips

Spend time planning your graph

Make sure to have the data in the correct structure before you startgraphing

Start with a simple graph, gradually build in complexity

·

·

·

40/42

Page 41: ggplot2 for Epi Studies - University of North Carolina at

Further reading

ggplot2: http://docs.ggplot2.org/current/

Cookbook for R: http://www.cookbook-r.com/Graphs/

Quick-R: http://www.statmethods.net/index.html

·

·

·

41/42

Page 42: ggplot2 for Epi Studies - University of North Carolina at

Wrap-up

Questions?

Acknowledgements: Alan Brookhart, Sara Levintow

Contact info: [email protected]

·

·

·

42/42