Exploratory exposure-response analysis

case studies

example code

EDA

workflow

The importance of understanding the scientific context.

Authors

Marc Vandemeulebroecke

Mark Baillie

Alison Margolskee

Baldur Magnusson

Published

December 19, 2023

This example illustrates the importance of understanding the scientific context when exploring data graphically. An exploratory data analysis is more than just “plotting data”; it can lead to a deeper understanding and inform next steps (Gabry et al. 2019; Gelman 2004). However, like an analysis that is poorly thought through, a poorly implemented graph can also deceive.

Consider an inhaled drug intended to improve lung function, with the target site of action in the lung. The drug is also absorbed systemically from the lung. Suppose that the team wants to fine-tune the choice of a recommended dose. A typical starting point for this question is often a plot of the response variable of interest against a summary measure of plasma concentration (e.g. the area under the concentration time curve, AUC). Figure 1 shows such a plot, generated using the default settings of the R package ggplot2.

## Read in data
my_data <- 
  read_csv("../../data/401_case1_PKPDdataset_ard.csv") %>%
  filter(CYCLE == 1)

## Plot response vs exposure
my_data %>%
  ggplot(aes(x = AUC, y = sCHG)) + 
  geom_point() + 
  scale_y_continuous(breaks = seq(-800, 800, 200)) +  
  theme_gray(base_size = 10) +
  labs(x = "RESN", y = "LIDV", title = "")

A scatterplot of response vs. exposure with ‘default’ ggplot theme. It is common during an exploratory data analysis to display variable names directly from source data rather than an informative description. For this example RESN = AUC0-24h (h*ug/mL) and LIDV = FEV1 change from baseline (mL).

In terms of good graphical principles, this plot leaves a fair bit to be desired. Several improvements are warranted, including proper axis scaling, gridlines, annotation, font size, etc. One particularly egregious issue is the lack of care in selecting axis labels, leaving programming labels for the plotted variables (presumably only then to make the effort of explaining them in a caption). An improved version is shown in Figure 2, addressing many of these formatting issues. With an added LOESS smoother (Cleveland 1979), we see a positive non-linear trend, suggesting a shallow sigmoidal exposure-response relationship.

my_data %>%
  ggplot(aes(x = AUC, y = sCHG)) + 
  geom_point(alpha = 0.7) + 
  geom_smooth(method = "loess", colour = "red") +
  scale_x_log10(breaks = lbr, labels = llb) + 
  scale_y_continuous(breaks = seq(-800, 800, 200)) +
  annotation_logticks(sides = "b") +
  labs(
    x = expression(paste("AUC0-24h (h*",mu,"g/mL)", sep = "")),
    y = "FEV1 change from baseline (mL)", 
    title = "Exposure is positively associated with response",
    subtitle = "Loess smoother (95% CI)"
    ) +
  paper_theme()

An improved scatterplot of exposure vs. response, including a LOESS smoothing curve to help visualizing the trend.

It is tempting, especially when presented with a suboptimal graph, to immediately set about fixing the various graphical imperfections and produce a more appropriate and visually appealing version of the same graph. This is an example of selective attention, focusing on the detail but overlooking the higher purpose of the task (i.e. the “why”). Instead, let us now take a step back and revisit this example in the context of the first law of visual communication: have a clear purpose.

Why are we conducting an exposure-response analysis? Recall that the scientific interest is to fine-tune the dose, and that the drug is inhaled and acting locally in the lung. The implicit assumption of an exposure-response analysis is one of causality. Here, however, plasma concentration is unlikely to be on the causal path from dose to response. What would be a better way to address the scientific question of interest?

my_data %>%
  ggplot(aes(x = AUC, y = sCHG, colour = factor(DOSE))) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_colour_brewer(palette = "Set2" , name = "Dose (mg)") + 
  scale_x_log10(breaks = lbr, labels = llb) + 
  scale_y_continuous(breaks = seq(-800, 800, 200)) + 
  annotation_logticks(sides = "b") +
  labs(
    x = expression(paste("AUC0-24h (h*", mu, "g/mL)", sep = "")), 
    y = "FEV1 change from baseline (mL)", 
    title = "Exposure is not a better predictor of response than dose") + 
  paper_theme() + 
  theme(
    legend.position = c("right"),
    legend.title = element_text(size = 10)
    )

Visualization of exposure and response within levels of dose. The scatterplot is fundamentally changed by revisiting the question of interest and then applying good graphical principles.

Consider Figure 3, where instead of estimating an overall trend we now look at the trends within dose. Clearly, any apparent trends within dose do not follow a consistent pattern across doses. The only reason why exposure and response appeared associated in the previous two plots is that they share a common cause, namely dose. In other words, dose is a confounder in those plots, and indeed dose is a better predictor of response than systemic concentration. We should build a dose-response model, rather than an exposure-response model, and choose a recommended dose based on this (and any information on safety and tolerability).

sessionInfo()

## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets 
## [6] methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3 lubridate_1.9.2   
##  [3] forcats_1.0.0      stringr_1.5.0     
##  [5] dplyr_1.1.2        purrr_1.0.1       
##  [7] readr_2.1.4        tidyr_1.3.0       
##  [9] tibble_3.2.1       ggplot2_3.4.2     
## [11] tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.3        generics_0.1.3    lattice_0.21-8   
##  [4] stringi_1.7.12    hms_1.1.3         digest_0.6.31    
##  [7] magrittr_2.0.3    evaluate_0.21     grid_4.3.0       
## [10] timechange_0.2.0  fastmap_1.1.1     Matrix_1.5-4     
## [13] jsonlite_1.8.4    mgcv_1.8-42       fansi_1.0.4      
## [16] scales_1.2.1      cli_3.6.1         rlang_1.1.1      
## [19] crayon_1.5.2      splines_4.3.0     bit64_4.0.5      
## [22] munsell_0.5.0     withr_2.5.0       yaml_2.3.7       
## [25] tools_4.3.0       parallel_4.3.0    tzdb_0.4.0       
## [28] colorspace_2.1-0  vctrs_0.6.2       R6_2.5.1         
## [31] lifecycle_1.0.3   htmlwidgets_1.6.2 bit_4.0.5        
## [34] vroom_1.6.3       pkgconfig_2.0.3   pillar_1.9.0     
## [37] gtable_0.3.3      glue_1.6.2        xfun_0.39        
## [40] tidyselect_1.2.0  rstudioapi_0.14   knitr_1.43       
## [43] farver_2.1.1      nlme_3.1-162      htmltools_0.5.5  
## [46] rmarkdown_2.22    labeling_0.4.2    compiler_4.3.0

References

Cleveland, William S. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association 74 (368): 829–36. https://doi.org/10.1080/01621459.1979.10481038.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 182 (2): 389–402. https://doi.org/10.1111/rssa.12378.

Gelman, Andrew. 2004. “Exploratory Data Analysis for Complex Models.” Journal of Computational and Graphical Statistics 13 (4): 755–79. https://doi.org/10.1198/106186004X11435.