Automating Reproducible, Collaborative Clinical Trial Document Generation with the listdown Package

R J. Pub Date : 2021-01-01 DOI:10.32614/rj-2021-051

M. Kane, Xun Jiang, Simon Urbanek

{"title":"Automating Reproducible, Collaborative Clinical Trial Document Generation with the listdown Package","authors":"M. Kane, Xun Jiang, Simon Urbanek","doi":"10.32614/rj-2021-051","DOIUrl":null,"url":null,"abstract":"The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is a critical component to the drug development and clinical research cycle. Automating the process of generating documents for data descriptions, summaries, exploration, and analysis allows statistician to provide a more comprehensive view of the information captured by a clinical trial and efficient generation of these documents allows the statistican to focus more on the conceptual development of a trial or trial analysis and less on the implementation of the summaries and results on which decisions are made. This paper explores the use of the listdown package for automating reproducible documents in clinical trials that facilitate the collaboration between statisticians and clinicians as well as defining an analysis pipeline for document generation. Background and Introduction The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is an often overlooked but critical component to the drug development and clinical research cycle. Graphs, tables, and other analysis artifacts are at the nexus of these collaborations. They facilitate identifying problems and bugs in the data preparation and processing stage, they help to build an intuitive understanding of mechanisms of disease and their treatment, they elucidate prognostic and predictive relationships, they provide insight that results in new hypotheses, and they convince researchers of analyses testing hypotheses. Despite their importance, the process of generating these artifacts is usually done in an ad-hoc manner. This is partially because of the nuance and diversity of the hypotheses and scientific questions being interrogated and, to a lesser degree, the variation in clinical data formatting. The usual process usually has a statistician providing a standard set of artifacts, receiving feedback, and providing an updates based on feedback. Work performed for one trial is rarely leveraged on others and as a result, a large amount of work needs to be reproduced for each trial. There are two glaring problems with this approach. First, each analysis of a trial requires a substantial amount of error-prone work. While the variation between trials means some work needs to be done for preparation, exploration, and analysis, there are many aspects of these trials that could be better automated resulting in greater efficiency and accuracy. Second, because this work is challenging, it often occupies the majority of the statisticians effort. Less time is spent on trial design and analysis and the this portion is taken up by a clinician who often has less expertise with the statistical aspects of the trial. As a result, the extra effort spent on processing data undermines statisticians role as a collaborator and relegates them to service provider. Need tools leveraging existing work to more efficiently provide holistic views on trials will result in less effort and more accurate and comprehensive trial design and analysis. The richness of R Core Team (2012)’s package ecosystem, particularly with its emphasis on analysis, visualization, reproducibility, and dissemination makes the goal of creating these tools for clinical trials feasible. Generation of tables is supported by packages including tableone (Yoshida and Bartel, 2020), gt (Iannone et al., 2020), gtsummary (Sjoberg et al., 2020). Visualization is achieved using package including ggplot2 (Wickham, 2016) and survminer (Kassambara et al., 2020). We can even provide interactive presentations of data with DT (Xie et al., 2020), plotly (Sievert, 2020), and trelliscopejs (Hafen and Schloerke, 2020). It should also be realized that work building on these tools for clinical trial data is already in process. The greport (Harrell Jr, 2020) package provides graphical summaries for clinical trials and has been used in conjunction with rmarkdown (Allaire et al., 2020) to produce specific trial report types with a specified format. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 Using listdown for programmatic, collaborative clinical trial document generation The listdown package (Kane et al., 2020) was recently released to automate the process of generating reproducible (RMarkdown) documents. Objects derived from a summary, exploration, or analysis are stored hierarchically in an R list, which defines the structure of the document. These objects are referred to as computational components since they are derived from computation, as opposed to prose, which makes up the narrative components of a document. The computational components capture and structure the objects to be presented. Describing how the objects will be presented and how the document will rendered is handled through the creation of a listdown object. The separation between how computational components are created and how they are shown to a user provides two advantages. First, it decouples the data processing and analysis from its exploration and visualization. For compute-intensive analyses this separation is critical for avoiding redundant computations for small changes in the presentation. It also discourages putting compute-intensive code into RMarkdown documents. Second, it provides the flexibility to quickly change how a computational component is visualized or summarized or even how a document is rendered. This makes transitioning from an interactive .html document to a static .pdf document significantly easier than substituting functions and parameters in an R Mardown document. The package has been found to be particularly useful in the reporting and research of clinical trial data. In particular, the package has been used for server collaborations focusing on either the analysis past trial data to formulate a new trial or in trial monitoring where trial telemetry (enrollment, responses, etc.) is reported and initial analyses are conveyed to a clinician. The associated presentations require very little context since clinicians often have as good an understanding of the data collected as that of the statistician’s meaning narrative components are not needed. At the same time, a large number of hierarchical, heterogeneous artifacts (tables and multiple types of plots) can be automated where manual creation of RMarkdown documents would be inconvenient and inefficient. The rest of this document describes concepts implemented in the listdown package for automated, reproducible document generation and shows its use with a simplified, synthetic clinical trial data set whose variables are typical of a non-small cell lung cancer trial. The data set comes from the forceps (Kane, 2020) package. As of the time this document was written, the package is under development and is not available on CRAN. However, it can be installed as follows. devtools::install_github(\"kaneplusplus/forceps\") The following section uses the trial data to construct a pipeline for document generation. We note that both the data and the pipeline is simple when compared to most analyses of this type. However, it is sufficient to illustrate accompanying concepts and both the analyses and concepts translate readily to real-world applications. A final section discusses the use of the package and its current direction. Constructing a pipeline for document generation The process of analyzing data can be described using the classic water fall model of Benington (1983) where the output (the analysis presentation or service) is dependent on a sequence of tasks that come before it. This dependency structure means that if a problem is detected in a given stage of the production of the analysis, all down-stream parts must be rerun to reflect the change. A graphical depiction of the waterfall model, specific to data analyses (clinical or otherwise) is shown in Figure 1. Note that data exploration and visualization are an integral part of all stages of the production and are often the means for identifying issues and refining analyses. As explained in the previous section, we are going to implement a simple analysis pipeline. The data acquisition and preprocessing steps are handled by importing data sets from the forceps package and using some of the functions implemented in the package to create a single trial data set thereby de-emphasizing these components in the pipeline. While these steps are critical, the emphasis of this paper is the incorporation of the listdown package into the later stages. Data acquisision and preprocessing Data acquisition refers to the portion of the analysis pipeline where the data is retrieved from some managed data store for integration into the pipeline. These data sets may be retrieved as tables from a database, case reports, Analysis Data Model (ADaM) data formatted according to the Clinical Data Interchange Standards Consortium (CDISC) (CDI, 2020), Electronic Health Records, or other clinical Real World Data (RWD) formats. These data are then transformed to a format appropriate for analysis. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Figure 1: The data analysis waterfall. In our simple example, this is accomplished by loading data corresponding to trial outcomes, patient adverse events, patient biomarkers, and patient demography and transforming them to a single data set with one row per patient and one variable per column using the forceps and dplyr (Wickham et al., 2020) packages. The data also includes longitudinal adverse event information, which will is stored as a nested data frame in the ae_long column of the resulting data set. library(forceps) library(dplyr) data(lc_adsl, lc_adverse_events, lc_biomarkers, lc_demography) lc_trial <consolidate( list(adsl = lc_adsl, adverse_events = lc_adverse_events %>% cohort(on = \"usubjid\", name = \"ae_long\"), biomarkers = lc_biomarkers, demography = lc_demography %>% select(-chemo_stop) ), on =","PeriodicalId":20974,"journal":{"name":"R J.","volume":"1 1","pages":"556"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"R J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32614/rj-2021-051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is a critical component to the drug development and clinical research cycle. Automating the process of generating documents for data descriptions, summaries, exploration, and analysis allows statistician to provide a more comprehensive view of the information captured by a clinical trial and efficient generation of these documents allows the statistican to focus more on the conceptual development of a trial or trial analysis and less on the implementation of the summaries and results on which decisions are made. This paper explores the use of the listdown package for automating reproducible documents in clinical trials that facilitate the collaboration between statisticians and clinicians as well as defining an analysis pipeline for document generation. Background and Introduction The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is an often overlooked but critical component to the drug development and clinical research cycle. Graphs, tables, and other analysis artifacts are at the nexus of these collaborations. They facilitate identifying problems and bugs in the data preparation and processing stage, they help to build an intuitive understanding of mechanisms of disease and their treatment, they elucidate prognostic and predictive relationships, they provide insight that results in new hypotheses, and they convince researchers of analyses testing hypotheses. Despite their importance, the process of generating these artifacts is usually done in an ad-hoc manner. This is partially because of the nuance and diversity of the hypotheses and scientific questions being interrogated and, to a lesser degree, the variation in clinical data formatting. The usual process usually has a statistician providing a standard set of artifacts, receiving feedback, and providing an updates based on feedback. Work performed for one trial is rarely leveraged on others and as a result, a large amount of work needs to be reproduced for each trial. There are two glaring problems with this approach. First, each analysis of a trial requires a substantial amount of error-prone work. While the variation between trials means some work needs to be done for preparation, exploration, and analysis, there are many aspects of these trials that could be better automated resulting in greater efficiency and accuracy. Second, because this work is challenging, it often occupies the majority of the statisticians effort. Less time is spent on trial design and analysis and the this portion is taken up by a clinician who often has less expertise with the statistical aspects of the trial. As a result, the extra effort spent on processing data undermines statisticians role as a collaborator and relegates them to service provider. Need tools leveraging existing work to more efficiently provide holistic views on trials will result in less effort and more accurate and comprehensive trial design and analysis. The richness of R Core Team (2012)’s package ecosystem, particularly with its emphasis on analysis, visualization, reproducibility, and dissemination makes the goal of creating these tools for clinical trials feasible. Generation of tables is supported by packages including tableone (Yoshida and Bartel, 2020), gt (Iannone et al., 2020), gtsummary (Sjoberg et al., 2020). Visualization is achieved using package including ggplot2 (Wickham, 2016) and survminer (Kassambara et al., 2020). We can even provide interactive presentations of data with DT (Xie et al., 2020), plotly (Sievert, 2020), and trelliscopejs (Hafen and Schloerke, 2020). It should also be realized that work building on these tools for clinical trial data is already in process. The greport (Harrell Jr, 2020) package provides graphical summaries for clinical trials and has been used in conjunction with rmarkdown (Allaire et al., 2020) to produce specific trial report types with a specified format. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 Using listdown for programmatic, collaborative clinical trial document generation The listdown package (Kane et al., 2020) was recently released to automate the process of generating reproducible (RMarkdown) documents. Objects derived from a summary, exploration, or analysis are stored hierarchically in an R list, which defines the structure of the document. These objects are referred to as computational components since they are derived from computation, as opposed to prose, which makes up the narrative components of a document. The computational components capture and structure the objects to be presented. Describing how the objects will be presented and how the document will rendered is handled through the creation of a listdown object. The separation between how computational components are created and how they are shown to a user provides two advantages. First, it decouples the data processing and analysis from its exploration and visualization. For compute-intensive analyses this separation is critical for avoiding redundant computations for small changes in the presentation. It also discourages putting compute-intensive code into RMarkdown documents. Second, it provides the flexibility to quickly change how a computational component is visualized or summarized or even how a document is rendered. This makes transitioning from an interactive .html document to a static .pdf document significantly easier than substituting functions and parameters in an R Mardown document. The package has been found to be particularly useful in the reporting and research of clinical trial data. In particular, the package has been used for server collaborations focusing on either the analysis past trial data to formulate a new trial or in trial monitoring where trial telemetry (enrollment, responses, etc.) is reported and initial analyses are conveyed to a clinician. The associated presentations require very little context since clinicians often have as good an understanding of the data collected as that of the statistician’s meaning narrative components are not needed. At the same time, a large number of hierarchical, heterogeneous artifacts (tables and multiple types of plots) can be automated where manual creation of RMarkdown documents would be inconvenient and inefficient. The rest of this document describes concepts implemented in the listdown package for automated, reproducible document generation and shows its use with a simplified, synthetic clinical trial data set whose variables are typical of a non-small cell lung cancer trial. The data set comes from the forceps (Kane, 2020) package. As of the time this document was written, the package is under development and is not available on CRAN. However, it can be installed as follows. devtools::install_github("kaneplusplus/forceps") The following section uses the trial data to construct a pipeline for document generation. We note that both the data and the pipeline is simple when compared to most analyses of this type. However, it is sufficient to illustrate accompanying concepts and both the analyses and concepts translate readily to real-world applications. A final section discusses the use of the package and its current direction. Constructing a pipeline for document generation The process of analyzing data can be described using the classic water fall model of Benington (1983) where the output (the analysis presentation or service) is dependent on a sequence of tasks that come before it. This dependency structure means that if a problem is detected in a given stage of the production of the analysis, all down-stream parts must be rerun to reflect the change. A graphical depiction of the waterfall model, specific to data analyses (clinical or otherwise) is shown in Figure 1. Note that data exploration and visualization are an integral part of all stages of the production and are often the means for identifying issues and refining analyses. As explained in the previous section, we are going to implement a simple analysis pipeline. The data acquisition and preprocessing steps are handled by importing data sets from the forceps package and using some of the functions implemented in the package to create a single trial data set thereby de-emphasizing these components in the pipeline. While these steps are critical, the emphasis of this paper is the incorporation of the listdown package into the later stages. Data acquisision and preprocessing Data acquisition refers to the portion of the analysis pipeline where the data is retrieved from some managed data store for integration into the pipeline. These data sets may be retrieved as tables from a database, case reports, Analysis Data Model (ADaM) data formatted according to the Clinical Data Interchange Standards Consortium (CDISC) (CDI, 2020), Electronic Health Records, or other clinical Real World Data (RWD) formats. These data are then transformed to a format appropriate for analysis. The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Figure 1: The data analysis waterfall. In our simple example, this is accomplished by loading data corresponding to trial outcomes, patient adverse events, patient biomarkers, and patient demography and transforming them to a single data set with one row per patient and one variable per column using the forceps and dplyr (Wickham et al., 2020) packages. The data also includes longitudinal adverse event information, which will is stored as a nested data frame in the ae_long column of the resulting data set. library(forceps) library(dplyr) data(lc_adsl, lc_adverse_events, lc_biomarkers, lc_demography) lc_trial % cohort(on = "usubjid", name = "ae_long"), biomarkers = lc_biomarkers, demography = lc_demography %>% select(-chemo_stop) ), on =

查看原文本刊更多论文

使用listdown包自动生成可重复、协作的临床试验文件

将临床试验结果和分析结果从统计学家传递给临床研究者是药物开发和临床研究周期的关键组成部分。为数据描述、摘要、探索和分析生成文档的自动化过程使统计学家能够对临床试验捕获的信息提供更全面的视图，并且这些文档的高效生成使统计学家能够更多地关注试验或试验分析的概念发展，而不是关注制定决策的摘要和结果的实施。本文探讨了listdown包在临床试验中自动化可重复文件的使用，促进了统计学家和临床医生之间的合作，并定义了文档生成的分析管道。从统计学家到临床研究者的临床试验探索和分析结果的传递是药物开发和临床研究周期中经常被忽视的关键组成部分。图、表和其他分析工件是这些协作的连接点。它们有助于识别数据准备和处理阶段的问题和缺陷，有助于建立对疾病及其治疗机制的直观理解，阐明预后和预测关系，提供导致新假设的洞察力，并说服研究人员进行分析以检验假设。尽管它们很重要，但是生成这些工件的过程通常是以一种特别的方式完成的。这部分是因为假设和科学问题的细微差别和多样性，在较小程度上，临床数据格式的变化。通常的流程通常由统计人员提供一组标准工件，接收反馈，并根据反馈提供更新。为一个试验执行的工作很少对其他试验产生影响，因此，需要为每个试验重复大量的工作。这种方法有两个明显的问题。首先，每次试验分析都需要大量容易出错的工作。虽然试验之间的差异意味着需要做一些准备、探索和分析工作，但这些试验的许多方面可以更好地自动化，从而提高效率和准确性。其次，由于这项工作具有挑战性，它往往占据了统计学家的大部分精力。在试验设计和分析上花费的时间较少，这部分工作由临床医生承担，他们通常对试验的统计方面缺乏专业知识。因此，花费在处理数据上的额外努力破坏了统计学家作为合作者的角色，并将他们降级为服务提供者。需要利用现有工作的工具来更有效地提供有关试验的整体视图，这将减少工作量，并使试验设计和分析更加准确和全面。R Core Team(2012)的软件包生态系统的丰富性，特别是其对分析、可视化、可再现性和传播的强调，使得为临床试验创建这些工具的目标变得可行。表的生成由tableone (Yoshida and Bartel, 2020)、gt (Iannone et al.， 2020)、gtsummary (Sjoberg et al.， 2020)等软件包支持。可视化使用包括ggplot2 (Wickham, 2016)和survminer (Kassambara et al.， 2020)在内的软件包实现。我们甚至可以使用DT (Xie et al.， 2020)、plot (Sievert, 2020)和trelliscopejs (Hafen and Schloerke, 2020)提供数据的交互式演示。还应该认识到，基于这些临床试验数据工具的工作已经在进行中。报告(Harrell Jr, 2020)包提供了临床试验的图形摘要，并与markdown (Allaire et al.， 2020)一起使用，以生成具有特定格式的特定试验报告类型。最近发布的listdown包(Kane et al.， 2020)用于自动化生成可重复(RMarkdown)文档的过程。从摘要、探索或分析派生的对象按层次结构存储在R列表中，该列表定义了文档的结构。这些对象被称为计算组件，因为它们来自于计算，而不是散文，后者构成了文档的叙述组件。计算组件捕获并构造要呈现的对象。通过创建listdown对象来描述对象的呈现方式和文档的呈现方式。计算组件的创建方式与向用户显示方式的分离提供了两个优势。首先，它将数据处理和分析与数据的探索和可视化分离开来。对于计算密集型分析，这种分离对于避免对表示中的小更改进行冗余计算至关重要。它还不鼓励将计算密集型代码放入RMarkdown文档中。其次，它提供了快速更改计算组件的可视化或汇总方式，甚至文档呈现方式的灵活性。这使得从交互式。html文档到静态。pdf文档的转换比替换R Mardown文档中的函数和参数要容易得多。研究发现，该软件包在临床试验数据的报告和研究中特别有用。特别是，该软件包已用于服务器协作，重点是分析过去的试验数据以制定新的试验，或用于试验监测，其中报告试验遥测(登记，反应等)并将初步分析传达给临床医生。相关的陈述需要很少的背景，因为临床医生通常对收集的数据有很好的理解，就像统计学家的意义叙述成分一样，不需要。同时，大量的分层的、异构的工件(表和多种类型的图)可以自动化，而手工创建RMarkdown文档将是不方便和低效的。本文的其余部分描述了在listdown包中实现的概念，用于自动化、可重复的文档生成，并展示了它与一个简化的、合成的临床试验数据集的使用，该数据集的变量是非小细胞肺癌试验的典型变量。数据集来自钳包(Kane, 2020)。在撰写本文档时，该软件包仍在开发中，在CRAN上不可用。但是，它可以按照以下方式安装。下面的小节使用试验数据来构建一个用于文档生成的管道。我们注意到，与大多数此类分析相比，数据和管道都很简单。但是，说明相关概念就足够了，并且分析和概念都可以很容易地转换为实际应用程序。最后一节讨论包的使用及其当前方向。分析数据的过程可以使用Benington(1983)的经典瀑布模型来描述，其中输出(分析表示或服务)依赖于在它之前出现的一系列任务。这种依赖结构意味着，如果在分析生产的给定阶段检测到问题，则必须重新运行所有下游部分以反映更改。瀑布模型的图形化描述，具体到数据分析(临床或其他)如图1所示。请注意，数据探索和可视化是生产所有阶段不可或缺的一部分，通常是识别问题和改进分析的手段。如前一节所述，我们将实现一个简单的分析管道。数据采集和预处理步骤通过从镊子包导入数据集并使用包中实现的一些功能来创建单个试验数据集来处理，从而减少管道中的这些组件的重要性。虽然这些步骤是至关重要的，但本文的重点是将listdown包合并到后面的阶段。数据采集和预处理数据采集是指分析管道的一部分，在此部分中，数据从某些托管数据存储中检索，以便集成到管道中。这些数据集可以作为表从数据库、病例报告、根据临床数据交换标准联盟(CDISC) (CDI, 2020)、电子健康记录或其他临床真实世界数据(RWD)格式格式化的分析数据模型(ADaM)数据中检索。然后将这些数据转换为适合分析的格式。R Journal Vol. XX/YY, AAAA ISSN 2073-4859贡献研究文章3图1:数据分析瀑布。在我们的简单示例中，这是通过加载与试验结果、患者不良事件、患者生物标志物和患者人口统计相对应的数据，并使用镊子和dplyr (Wickham et al.， 2020)包将其转换为单个数据集，每个患者一行，每列一个变量来完成的。数据还包括纵向不良事件信息，这些信息将作为一个嵌套数据框架存储在结果数据集的ae_long列中。库(forceps)库(dplyr)数据(lc_adsl, lc_adverse_events, lc_biomarkers, lc_demography) lc_trial %队列(on = "usubjid"， name = "ae_long")， biomarkers = lc_biomarkers, demography = lc_demography %>% select(-chemo_stop))， on =

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

R J.

自引率

0.00%

发文量