{"title":"Automating Reproducible, Collaborative Clinical Trial Document Generation with the listdown Package","authors":"M. Kane, Xun Jiang, Simon Urbanek","doi":"10.32614/rj-2021-051","DOIUrl":null,"url":null,"abstract":"The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is a critical component to the drug development and clinical research cycle. Automating the process of generating documents for data descriptions, summaries, exploration, and analysis allows statistician to provide a more comprehensive view of the information captured by a clinical trial and efficient generation of these documents allows the statistican to focus more on the conceptual development of a trial or trial analysis and less on the implementation of the summaries and results on which decisions are made. This paper explores the use of the listdown package for automating reproducible documents in clinical trials that facilitate the collaboration between statisticians and clinicians as well as defining an analysis pipeline for document generation. Background and Introduction The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is an often overlooked but critical component to the drug development and clinical research cycle. Graphs, tables, and other analysis artifacts are at the nexus of these collaborations. They facilitate identifying problems and bugs in the data preparation and processing stage, they help to build an intuitive understanding of mechanisms of disease and their treatment, they elucidate prognostic and predictive relationships, they provide insight that results in new hypotheses, and they convince researchers of analyses testing hypotheses. Despite their importance, the process of generating these artifacts is usually done in an ad-hoc manner. This is partially because of the nuance and diversity of the hypotheses and scientific questions being interrogated and, to a lesser degree, the variation in clinical data formatting. The usual process usually has a statistician providing a standard set of artifacts, receiving feedback, and providing an updates based on feedback. Work performed for one trial is rarely leveraged on others and as a result, a large amount of work needs to be reproduced for each trial. There are two glaring problems with this approach. First, each analysis of a trial requires a substantial amount of error-prone work. While the variation between trials means some work needs to be done for preparation, exploration, and analysis, there are many aspects of these trials that could be better automated resulting in greater efficiency and accuracy. Second, because this work is challenging, it often occupies the majority of the statisticians effort. Less time is spent on trial design and analysis and the this portion is taken up by a clinician who often has less expertise with the statistical aspects of the trial. As a result, the extra effort spent on processing data undermines statisticians role as a collaborator and relegates them to service provider. Need tools leveraging existing work to more efficiently provide holistic views on trials will result in less effort and more accurate and comprehensive trial design and analysis. The richness of R Core Team (2012)’s package ecosystem, particularly with its emphasis on analysis, visualization, reproducibility, and dissemination makes the goal of creating these tools for clinical trials feasible. Generation of tables is supported by packages including tableone (Yoshida and Bartel, 2020), gt (Iannone et al., 2020), gtsummary (Sjoberg et al., 2020). Visualization is achieved using package including ggplot2 (Wickham, 2016) and survminer (Kassambara et al., 2020). We can even provide interactive presentations of data with DT (Xie et al., 2020), plotly (Sievert, 2020), and trelliscopejs (Hafen and Schloerke, 2020). It should also be realized that work building on these tools for clinical trial data is already in process. The greport (Harrell Jr, 2020) package provides graphical summaries for clinical trials and has been used in conjunction with rmarkdown (Allaire et al., 2020) to produce specific trial report types with a specified format. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 Using listdown for programmatic, collaborative clinical trial document generation The listdown package (Kane et al., 2020) was recently released to automate the process of generating reproducible (RMarkdown) documents. Objects derived from a summary, exploration, or analysis are stored hierarchically in an R list, which defines the structure of the document. These objects are referred to as computational components since they are derived from computation, as opposed to prose, which makes up the narrative components of a document. The computational components capture and structure the objects to be presented. Describing how the objects will be presented and how the document will rendered is handled through the creation of a listdown object. The separation between how computational components are created and how they are shown to a user provides two advantages. First, it decouples the data processing and analysis from its exploration and visualization. For compute-intensive analyses this separation is critical for avoiding redundant computations for small changes in the presentation. It also discourages putting compute-intensive code into RMarkdown documents. Second, it provides the flexibility to quickly change how a computational component is visualized or summarized or even how a document is rendered. This makes transitioning from an interactive .html document to a static .pdf document significantly easier than substituting functions and parameters in an R Mardown document. The package has been found to be particularly useful in the reporting and research of clinical trial data. In particular, the package has been used for server collaborations focusing on either the analysis past trial data to formulate a new trial or in trial monitoring where trial telemetry (enrollment, responses, etc.) is reported and initial analyses are conveyed to a clinician. The associated presentations require very little context since clinicians often have as good an understanding of the data collected as that of the statistician’s meaning narrative components are not needed. At the same time, a large number of hierarchical, heterogeneous artifacts (tables and multiple types of plots) can be automated where manual creation of RMarkdown documents would be inconvenient and inefficient. The rest of this document describes concepts implemented in the listdown package for automated, reproducible document generation and shows its use with a simplified, synthetic clinical trial data set whose variables are typical of a non-small cell lung cancer trial. The data set comes from the forceps (Kane, 2020) package. As of the time this document was written, the package is under development and is not available on CRAN. However, it can be installed as follows. devtools::install_github(\"kaneplusplus/forceps\") The following section uses the trial data to construct a pipeline for document generation. We note that both the data and the pipeline is simple when compared to most analyses of this type. However, it is sufficient to illustrate accompanying concepts and both the analyses and concepts translate readily to real-world applications. A final section discusses the use of the package and its current direction. Constructing a pipeline for document generation The process of analyzing data can be described using the classic water fall model of Benington (1983) where the output (the analysis presentation or service) is dependent on a sequence of tasks that come before it. This dependency structure means that if a problem is detected in a given stage of the production of the analysis, all down-stream parts must be rerun to reflect the change. A graphical depiction of the waterfall model, specific to data analyses (clinical or otherwise) is shown in Figure 1. Note that data exploration and visualization are an integral part of all stages of the production and are often the means for identifying issues and refining analyses. As explained in the previous section, we are going to implement a simple analysis pipeline. The data acquisition and preprocessing steps are handled by importing data sets from the forceps package and using some of the functions implemented in the package to create a single trial data set thereby de-emphasizing these components in the pipeline. While these steps are critical, the emphasis of this paper is the incorporation of the listdown package into the later stages. Data acquisision and preprocessing Data acquisition refers to the portion of the analysis pipeline where the data is retrieved from some managed data store for integration into the pipeline. These data sets may be retrieved as tables from a database, case reports, Analysis Data Model (ADaM) data formatted according to the Clinical Data Interchange Standards Consortium (CDISC) (CDI, 2020), Electronic Health Records, or other clinical Real World Data (RWD) formats. These data are then transformed to a format appropriate for analysis. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Figure 1: The data analysis waterfall. In our simple example, this is accomplished by loading data corresponding to trial outcomes, patient adverse events, patient biomarkers, and patient demography and transforming them to a single data set with one row per patient and one variable per column using the forceps and dplyr (Wickham et al., 2020) packages. The data also includes longitudinal adverse event information, which will is stored as a nested data frame in the ae_long column of the resulting data set. library(forceps) library(dplyr) data(lc_adsl, lc_adverse_events, lc_biomarkers, lc_demography) lc_trial <consolidate( list(adsl = lc_adsl, adverse_events = lc_adverse_events %>% cohort(on = \"usubjid\", name = \"ae_long\"), biomarkers = lc_biomarkers, demography = lc_demography %>% select(-chemo_stop) ), on =","PeriodicalId":20974,"journal":{"name":"R J.","volume":"1 1","pages":"556"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"R J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32614/rj-2021-051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is a critical component to the drug development and clinical research cycle. Automating the process of generating documents for data descriptions, summaries, exploration, and analysis allows statistician to provide a more comprehensive view of the information captured by a clinical trial and efficient generation of these documents allows the statistican to focus more on the conceptual development of a trial or trial analysis and less on the implementation of the summaries and results on which decisions are made. This paper explores the use of the listdown package for automating reproducible documents in clinical trials that facilitate the collaboration between statisticians and clinicians as well as defining an analysis pipeline for document generation. Background and Introduction The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is an often overlooked but critical component to the drug development and clinical research cycle. Graphs, tables, and other analysis artifacts are at the nexus of these collaborations. They facilitate identifying problems and bugs in the data preparation and processing stage, they help to build an intuitive understanding of mechanisms of disease and their treatment, they elucidate prognostic and predictive relationships, they provide insight that results in new hypotheses, and they convince researchers of analyses testing hypotheses. Despite their importance, the process of generating these artifacts is usually done in an ad-hoc manner. This is partially because of the nuance and diversity of the hypotheses and scientific questions being interrogated and, to a lesser degree, the variation in clinical data formatting. The usual process usually has a statistician providing a standard set of artifacts, receiving feedback, and providing an updates based on feedback. Work performed for one trial is rarely leveraged on others and as a result, a large amount of work needs to be reproduced for each trial. There are two glaring problems with this approach. First, each analysis of a trial requires a substantial amount of error-prone work. While the variation between trials means some work needs to be done for preparation, exploration, and analysis, there are many aspects of these trials that could be better automated resulting in greater efficiency and accuracy. Second, because this work is challenging, it often occupies the majority of the statisticians effort. Less time is spent on trial design and analysis and the this portion is taken up by a clinician who often has less expertise with the statistical aspects of the trial. As a result, the extra effort spent on processing data undermines statisticians role as a collaborator and relegates them to service provider. Need tools leveraging existing work to more efficiently provide holistic views on trials will result in less effort and more accurate and comprehensive trial design and analysis. The richness of R Core Team (2012)’s package ecosystem, particularly with its emphasis on analysis, visualization, reproducibility, and dissemination makes the goal of creating these tools for clinical trials feasible. Generation of tables is supported by packages including tableone (Yoshida and Bartel, 2020), gt (Iannone et al., 2020), gtsummary (Sjoberg et al., 2020). Visualization is achieved using package including ggplot2 (Wickham, 2016) and survminer (Kassambara et al., 2020). We can even provide interactive presentations of data with DT (Xie et al., 2020), plotly (Sievert, 2020), and trelliscopejs (Hafen and Schloerke, 2020). It should also be realized that work building on these tools for clinical trial data is already in process. The greport (Harrell Jr, 2020) package provides graphical summaries for clinical trials and has been used in conjunction with rmarkdown (Allaire et al., 2020) to produce specific trial report types with a specified format. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 Using listdown for programmatic, collaborative clinical trial document generation The listdown package (Kane et al., 2020) was recently released to automate the process of generating reproducible (RMarkdown) documents. Objects derived from a summary, exploration, or analysis are stored hierarchically in an R list, which defines the structure of the document. These objects are referred to as computational components since they are derived from computation, as opposed to prose, which makes up the narrative components of a document. The computational components capture and structure the objects to be presented. Describing how the objects will be presented and how the document will rendered is handled through the creation of a listdown object. The separation between how computational components are created and how they are shown to a user provides two advantages. First, it decouples the data processing and analysis from its exploration and visualization. For compute-intensive analyses this separation is critical for avoiding redundant computations for small changes in the presentation. It also discourages putting compute-intensive code into RMarkdown documents. Second, it provides the flexibility to quickly change how a computational component is visualized or summarized or even how a document is rendered. This makes transitioning from an interactive .html document to a static .pdf document significantly easier than substituting functions and parameters in an R Mardown document. The package has been found to be particularly useful in the reporting and research of clinical trial data. In particular, the package has been used for server collaborations focusing on either the analysis past trial data to formulate a new trial or in trial monitoring where trial telemetry (enrollment, responses, etc.) is reported and initial analyses are conveyed to a clinician. The associated presentations require very little context since clinicians often have as good an understanding of the data collected as that of the statistician’s meaning narrative components are not needed. At the same time, a large number of hierarchical, heterogeneous artifacts (tables and multiple types of plots) can be automated where manual creation of RMarkdown documents would be inconvenient and inefficient. The rest of this document describes concepts implemented in the listdown package for automated, reproducible document generation and shows its use with a simplified, synthetic clinical trial data set whose variables are typical of a non-small cell lung cancer trial. The data set comes from the forceps (Kane, 2020) package. As of the time this document was written, the package is under development and is not available on CRAN. However, it can be installed as follows. devtools::install_github("kaneplusplus/forceps") The following section uses the trial data to construct a pipeline for document generation. We note that both the data and the pipeline is simple when compared to most analyses of this type. However, it is sufficient to illustrate accompanying concepts and both the analyses and concepts translate readily to real-world applications. A final section discusses the use of the package and its current direction. Constructing a pipeline for document generation The process of analyzing data can be described using the classic water fall model of Benington (1983) where the output (the analysis presentation or service) is dependent on a sequence of tasks that come before it. This dependency structure means that if a problem is detected in a given stage of the production of the analysis, all down-stream parts must be rerun to reflect the change. A graphical depiction of the waterfall model, specific to data analyses (clinical or otherwise) is shown in Figure 1. Note that data exploration and visualization are an integral part of all stages of the production and are often the means for identifying issues and refining analyses. As explained in the previous section, we are going to implement a simple analysis pipeline. The data acquisition and preprocessing steps are handled by importing data sets from the forceps package and using some of the functions implemented in the package to create a single trial data set thereby de-emphasizing these components in the pipeline. While these steps are critical, the emphasis of this paper is the incorporation of the listdown package into the later stages. Data acquisision and preprocessing Data acquisition refers to the portion of the analysis pipeline where the data is retrieved from some managed data store for integration into the pipeline. These data sets may be retrieved as tables from a database, case reports, Analysis Data Model (ADaM) data formatted according to the Clinical Data Interchange Standards Consortium (CDISC) (CDI, 2020), Electronic Health Records, or other clinical Real World Data (RWD) formats. These data are then transformed to a format appropriate for analysis. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Figure 1: The data analysis waterfall. In our simple example, this is accomplished by loading data corresponding to trial outcomes, patient adverse events, patient biomarkers, and patient demography and transforming them to a single data set with one row per patient and one variable per column using the forceps and dplyr (Wickham et al., 2020) packages. The data also includes longitudinal adverse event information, which will is stored as a nested data frame in the ae_long column of the resulting data set. library(forceps) library(dplyr) data(lc_adsl, lc_adverse_events, lc_biomarkers, lc_demography) lc_trial % cohort(on = "usubjid", name = "ae_long"), biomarkers = lc_biomarkers, demography = lc_demography %>% select(-chemo_stop) ), on =