Why a waterfall plot?

case studies

example code

waterfall plot

defaults

Questioning the defaults.

Authors

Marc Vandemeulebroecke

Mark Baillie

Alison Margolskee

Baldur Magnusson

Published

December 19, 2023

This case study illustrates the importance of aligning a graph with the scientific question it should address, the option of filtering signals through a model, and finally the display of a scientific answer in a condensed messaging graph.

Consider a small early development trial, randomized and placebo-controlled (2:1 randomization), with a continuous primary endpoint measured at baseline and longitudinally over a period of 4 weeks. Lower outcome values are better, and there are no dropouts and no missing data. Suppose that the team is interested in the effect of the drug at the last measurement time point, as it is often the case. A common approach in early development trials is to simply plot the observed change scores in a so-called “waterfall plot” such as Figure 1.

ggplot(t4dat, aes(x = sort_id, y = cfb, fill = factor(ztext))) +
  geom_col(alpha = 0.8) +
  scale_fill_brewer(palette = "Dark2", name = "Treatment") +
  paper_theme() +
  theme(
    legend.position = c(0.8, 0.2),
    legend.background = element_blank(),
    legend.title = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA)
  ) +
  labs(x = "Subject", y = "Change from baseline",
       title = "Week 4 outcome by treatment")

To probe Law 1, what is the question addressed by this plot? It asks about the treatment effect after 4 weeks of treatment. Is this the right question? Let us assume for a moment that it is. Then a waterfall plot is not ideal for addressing it. Small treatment effects are difficult to discern, especially with an unbalanced randomization ratio. The audience must observe the distribution of color across the entire plot just to determine which treatment is more beneficial; this can become even more difficult with a larger sample size or more than two treatment groups. In Figure 1, one might see a treatment benefit, but how large is it and how certain of it are we? The popularity of waterfall plots is a mystery.

If we insist on week 4 as the only time point of interest, we could present overlaid density plots or side-by-side boxplots for a better appreciation of the difference in distribution between the two treatment arms. Figure 2 shows an example with the raw data points included, which is a much better alternative to Figure 1. The side-by-side placement facilitates the treatment comparison, and the plot is simple, familiar and uses minimal ink for what it shows. Graphical attributes (colors, font size, etc.) are easily readable.

ggplot(data = t4dat, aes(x = ztext, y = cfb, colour = ztext)) +
  geom_boxplot(width = 0.25) +
  geom_jitter(alpha = 0.25, width = 0.1) +
  scale_colour_brewer(palette = "Dark2") +
  labs(x = "", y = "Change from baseline") +
  paper_theme() +
  theme(legend.position = "none") +
  labs(x = "", y = "Change from baseline",
       title = "Week 4 outcome") +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 0.25),
    plot.title = element_text(size = 12),
    axis.text.y = element_text(size = 8),
    axis.text.x = element_text(size = 10),
    axis.title.y = element_text(size = 10),
    axis.ticks.x = element_blank(),
    panel.grid.major.x = element_blank()
  )

However, with such rich longitudinal data, it may be more informative to ask the question about the treatment effect during – not just after – the first 4 weeks of treatment. This is especially relevant in the early, more exploratory development phase (and it would be even more relevant if there were dropouts). As a rule, the recommended first step is to visualize the totality of the data. Figure 3 does this and includes means by treatment and time point. We see large inter-individual variability and overlap between the treatment groups. We also start to get an appreciation for the time-course of a mean effect. We see linear trajectories of the means over time, with the active arm appearing to improve and the placebo arm staying fairly constant. We cannot exclude that the apparent gap might continue to increase beyond 4 weeks of treatment. This plot, while doing little more than displaying the raw data, is already worth sharing with the project team. It facilitates a much richer understanding of the data than the previous two plots. It shows the data clearly i.e. Law 2.

md <- dres %>%
  group_by(time, ztext) %>%
  summarise(
    m = mean(y),
    s = sd(y),
    n = n(),
    se = s / sqrt(n)
  )

ggplot() +
  theme(legend.position = c(0.8, 0.65),
        legend.background = element_blank()) +
  geom_line(data = dres,
            aes(
              x = time,
              y = y,
              group = id,
              colour = factor(ztext)
            ),
            alpha = 0.35) +
  geom_line(data = md,
            aes(
              x = time,
              y = m,
              colour = factor(ztext)
            ),
            size = 1) +
  geom_point(data = md, aes(
    x = time,
    y = m,
    ymin = m - s,
    ymax = m + s,
    colour = ztext
  )) +
  scale_y_continuous(limits = c(min(dres$y) - 2.5, max(dres$y))) +
  scale_colour_brewer(palette = "Dark2", name = "") +
  labs(x = "Week", y = "Outcome",
       title = "Longitudinal individual outcomes \nwith group means") +
  paper_theme() +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 0.25),
    plot.title = element_text(size = 10),
    axis.text.y = element_text(size = 8),
    axis.text.x = element_text(size = 8),
    axis.title.y = element_text(size = 10),
    axis.title.x = element_text(size = 10),
    legend.position = c(0.15, 0.225),
    legend.background = element_blank()
  )

Depending on the goal of the analysis, we could stop here. But if we want to quantify the treatment difference while adjusting for important covariates, we should proceed with a statistical model. Based on Figure 3 a linear model appears appropriate. We fit a linear model with treatment, patient-specific intercept and slope, and we now also adjust for the baseline value of the primary endpoint and for any other important covariates. We can then visualize the data filtered through this model, omitting the raw data but displaying longitudinal point estimates and some form of uncertainty intervals for both treatment groups (Figure 4). This gets closer to the nature of a messaging graph, focusing directly on the results of our model. Optionally – and depending on the audience! – we could even go one final step further and display the treatment difference directly, as in Figure 5. This plot addresses the question about the treatment effect over time without requiring any mental arithmetic. We can read off approximate estimates for the treatment effect, and the level of confidence is easily appreciable from the added confidence band (which does include zero!). Appropriate and parsimonious annotations make the message even more obvious, Law 3, also through “effectively redundant” information (stating what can be seen).

## Model fit to longtiduinal data
adat <- dres %>%
  filter(time > 0)

mod <-
  stan_lmer(
    y ~ time * factor(z) + baseline + x1 + x2 + (time | id),
    data = adat,
    chains = 4,
    iter = 500,
    cores = 4
  )

nd1 <- dres %>%
  select(id, time, baseline, x1, x2, z)

nd2 <- dres %>%
  select(id, time, baseline, x1, x2, z) %>%
  mutate(z = 1 - z)

nd <- bind_rows(nd1, nd2) %>%
  arrange(id, time, z)

prs <- c(0.05, 0.5, 0.95)

ppnd <- nd %>% select(id, time, z) %>%
  bind_cols(as_tibble(t(
    posterior_linpred(mod, newdata = nd, re.form =  ~ 0)
  ))) %>%
  gather(4:1003, key = "ppid", value = "ypred") %>%
  spread(z, ypred) %>%
  mutate(contr = `1` - `0`) %>%
  group_by(ppid, time) %>%
  do({
    m0 <- mean(.$`0`)
    m1 <- mean(.$`1`)
    mc <- mean(.$contr)
    as_tibble(cbind(m0, m1, mc))
  }) %>%
  ungroup() %>%
  group_by(time) %>%
  do({
    q0 <- t(quantile(.$m0, probs = prs))
    q1 <- t(quantile(.$m1, probs = prs))
    contr <- t(quantile(.$mc, probs = prs))
    as_tibble(rbind(q0, q1, contr)) %>%
      mutate(var = c("Placebo", "Active", "Contrast"))
  })

## Treatment difference, median & 90% CI errorbars
ppnd %>%
  filter(var != "Contrast") %>%
  ggplot(aes(
    x = time,
    y = `50%`,
    ymin = `5%`,
    ymax = `95%`,
    colour = factor(var, levels = c("Placebo", "Active"))
  )) +
  geom_pointrange(position = position_dodge(width = 0.3)) +
  geom_line(position = position_dodge(width = 0.3)) +
  scale_color_brewer(palette = "Dark2", name = "") +
  paper_theme() +
  theme(legend.position = c(0.15, 0.245),
        legend.background = element_blank()) +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 0.25),
    plot.title = element_text(size = 10, margin = rep(unit(
      0.01 * 234, "mm"
    ), 3)),
    plot.subtitle = element_text(
      size = 8,
      color = rgb(0.3, 0.3, 0.3),
      margin = rep(unit(0.001 * 234, "mm"), 3)
    ),
    axis.text.y = element_text(size = 8),
    axis.text.x = element_text(size = 8),
    axis.title.y = element_text(size = 10),
    axis.title.x = element_text(size = 10)
  ) +
  labs(
    x = "Week",
    y = "Outcome",
    title = "Active group improves over time",
    subtitle = "Posterior median (90% credible interval)"
  )

Treatment difference, median & 90% CI errorbars

It is worth emphasizing that this last plot should not be the only one generated, and probably not the only one shown either. Strongly reduced messaging graphs require a robust understanding of the underlying data, which can only be built through a workflow such as the one described above. Further, depending on the situation and the audience, they might be challenged as loaded or unscientific. (E.g., the apparently perfect linear trend in Figure 5 appears “unrealistic”.) It is therefore important to ensure and emphasize that this last plot derives from a model which (as every model) is intended to separate the signal from the noise, and that the choice of this model is justified by a thorough inspection of the data.

ppnd %>%
  filter(var == "Contrast" & time > 0) %>%
  ggplot(aes(x = time, y = `50%`)) +
  geom_ribbon(aes(ymin = `5%`, ymax = `95%`), fill = "black", alpha = 0.15) +
  geom_point(size = 1.5) + geom_line(size = 1) +
  geom_hline(yintercept = 0, linetype = 2)  +
  labs(
    x = "Week",
    y = "Treatment difference",
    title = "Treatment effect increases over time",
    subtitle = "Posterior median (90% credible interval)"
  ) +
  geom_segment(
    aes(
      x = 0.98,
      y = -0.01,
      xend = 0.98,
      yend = -3.5
    ),
    arrow = arrow(length = unit(0.02 * 234, "mm")),
    alpha = 0.25
  ) +
  annotate(
    "text",
    label = "Greater benefit",
    x = 1.4,
    y = -3.4,
    size = 5,
    alpha = 0.75
  ) +
  paper_theme() +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 0.25),
    plot.subtitle = element_text(size = 10, color = rgb(0.3, 0.3, 0.3))
  )

Treatment difference, median & 90% CI ribbon

sessionInfo()

## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils    
## [6] datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3 data.table_1.14.8 
##  [3] gridExtra_2.3      rstanarm_2.26.1   
##  [5] Rcpp_1.0.10        caTools_1.18.2    
##  [7] lubridate_1.9.2    forcats_1.0.0     
##  [9] stringr_1.5.0      dplyr_1.1.2       
## [11] purrr_1.0.1        readr_2.1.4       
## [13] tidyr_1.3.0        tibble_3.2.1      
## [15] ggplot2_3.4.2      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7         inline_0.3.19       
##  [3] rlang_1.1.1          magrittr_2.0.3      
##  [5] matrixStats_1.2.0    compiler_4.3.0      
##  [7] loo_2.6.0            callr_3.7.3         
##  [9] vctrs_0.6.2          reshape2_1.4.4      
## [11] pkgconfig_2.0.3      crayon_1.5.2        
## [13] fastmap_1.1.1        backports_1.4.1     
## [15] ellipsis_0.3.2       labeling_0.4.2      
## [17] utf8_1.2.3           threejs_0.3.3       
## [19] promises_1.2.0.1     rmarkdown_2.22      
## [21] markdown_1.7         tzdb_0.4.0          
## [23] ps_1.7.5             nloptr_2.0.3        
## [25] xfun_0.39            jsonlite_1.8.4      
## [27] later_1.3.1          parallel_4.3.0      
## [29] prettyunits_1.1.1    R6_2.5.1            
## [31] dygraphs_1.1.1.6     stringi_1.7.12      
## [33] StanHeaders_2.26.28  boot_1.3-28.1       
## [35] rstan_2.32.3         knitr_1.43          
## [37] zoo_1.8-12           base64enc_0.1-3     
## [39] bayesplot_1.10.0     httpuv_1.6.11       
## [41] Matrix_1.5-4         splines_4.3.0       
## [43] igraph_1.6.0         timechange_0.2.0    
## [45] tidyselect_1.2.0     abind_1.4-5         
## [47] rstudioapi_0.14      yaml_2.3.7          
## [49] codetools_0.2-19     miniUI_0.1.1.1      
## [51] curl_5.0.0           processx_3.8.1      
## [53] pkgbuild_1.4.2       lattice_0.21-8      
## [55] plyr_1.8.8           shiny_1.7.5         
## [57] withr_2.5.0          posterior_1.5.0     
## [59] evaluate_0.21        survival_3.5-5      
## [61] RcppParallel_5.1.7   xts_0.13.1          
## [63] pillar_1.9.0         tensorA_0.36.2.1    
## [65] checkmate_2.2.0      DT_0.28             
## [67] stats4_4.3.0         shinyjs_2.1.0       
## [69] distributional_0.3.2 generics_0.1.3      
## [71] hms_1.1.3            rstantools_2.3.1.1  
## [73] munsell_0.5.0        scales_1.2.1        
## [75] minqa_1.2.5          gtools_3.9.4        
## [77] xtable_1.8-4         glue_1.6.2          
## [79] tools_4.3.0          shinystan_2.6.0     
## [81] lme4_1.1-33          colourpicker_1.3.0  
## [83] QuickJSR_1.0.9       crosstalk_1.2.0     
## [85] colorspace_2.1-0     nlme_3.1-162        
## [87] cli_3.6.1            fansi_1.0.4         
## [89] V8_4.3.0             gtable_0.3.3        
## [91] digest_0.6.31        farver_2.1.1        
## [93] htmlwidgets_1.6.2    htmltools_0.5.5     
## [95] lifecycle_1.0.3      mime_0.12           
## [97] shinythemes_1.2.0    MASS_7.3-58.4