Beatriz Serrano-Solano, A. Fouilloux, Ignacio Eguinoa, Matúš Kalaš, B. Grüning, Frederik Coppens
{"title":"Galaxy: A Decade of Realising CWFR Concepts","authors":"Beatriz Serrano-Solano, A. Fouilloux, Ignacio Eguinoa, Matúš Kalaš, B. Grüning, Frederik Coppens","doi":"10.1162/dint_a_00136","DOIUrl":"https://doi.org/10.1162/dint_a_00136","url":null,"abstract":"Abstract Despite recent encouragement to follow the FAIR principles, the day-to-day research practices have not changed substantially. Due to new developments and the increasing pressure to apply best practices, initiatives to improve the efficiency and reproducibility of scientific workflows are becoming more prevalent. In this article, we discuss the importance of well-annotated tools and the specific requirements to ensure reproducible research with FAIR outputs. We detail how Galaxy, an open-source workflow management system with a web-based interface, has implemented the concepts that are put forward by the Canonical Workflow Framework for Research (CWFR), whilst minimising changes to the practices of scientific communities. Although we showcase concrete applications from two different domains, this approach is generalisable to any domain and particularly useful in interdisciplinary research and science-based applications.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49187666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Wittenburg, A. Hardisty, Amirpasha Mozzafari, Limor Peer, N. Skvortsov, A. Spinuso, Zhiming Zhao
{"title":"Editors’ Note: Special Issue on Canonical Workflow Frameworks for Research","authors":"P. Wittenburg, A. Hardisty, Amirpasha Mozzafari, Limor Peer, N. Skvortsov, A. Spinuso, Zhiming Zhao","doi":"10.1162/dint_e_00122","DOIUrl":"https://doi.org/10.1162/dint_e_00122","url":null,"abstract":"1Gemeindweg 55, 47533 Kleve, Germany 2Cardiff University, Cardiff, South Glamorgan , CF14 3UX, Wales, UK 3Forschungszentrum Jülich GmbH, 52425 Jülich, Germany 4Institution for Social and Policy Studies, Yale University, New Haven, CT 06520, USA 5Vavilov 44/2, 121351 Moscow, Russia 6Utrechtseweg 297, 3731 GA De Bilt, the Netherlands 7University of Amsterdam, PO-Box 94323, 1090 GH Amsterdam, the Netherlands","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45697513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dirk Betz, Claudia Biniossek, Christophe Blanchi, Felix Henninger, T. Lauer, P. Wieder, P. Wittenburg, M. Zünkeler
{"title":"Canonical Workflow for Experimental Research","authors":"Dirk Betz, Claudia Biniossek, Christophe Blanchi, Felix Henninger, T. Lauer, P. Wieder, P. Wittenburg, M. Zünkeler","doi":"10.1162/dint_a_00123","DOIUrl":"https://doi.org/10.1162/dint_a_00123","url":null,"abstract":"Abstract The overall expectation of introducing Canonical Workflow for Experimental Research and FAIR digital objects (FDOs) can be summarised as reducing the gap between workflow technology and research practices to make experimental work more efficient and improve FAIRness without adding administrative load on the researchers. In this document, we will describe, with the help of an example, how CWFR could work in detail and improve research procedures. We have chosen the example of “experiments with human subjects” which stretches from planning an experiment to storing the collected data in a repository. While we focus on experiments with human subjects, we are convinced that CWFR can be applied to many other data generation processes based on experiments. The main challenge is to identify repeating patterns in existing research practices that can be abstracted to create CWFR. In this document, we will include detailed examples from different disciplines to demonstrate that CWFR can be implemented without violating specific disciplinary or methodological requirements. We do not claim to be comprehensive in all aspects, since these examples are meant to prove the concept of CWFR.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42683678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Canonical Workflow for Machine Learning Tasks","authors":"Christophe Blanchi, B. Gebre, P. Wittenburg","doi":"10.1162/dint_a_00124","DOIUrl":"https://doi.org/10.1162/dint_a_00124","url":null,"abstract":"Abstract There is a huge gap between (1) the state of workflow technology on the one hand and the practices in the many labs working with data driven methods on the other and (2) the awareness of the FAIR principles and the lack of changes in practices during the last 5 years. The CWFR concept has been defined which is meant to combine these two intentions, increasing the use of workflow technology and improving FAIR compliance. In the study described in this paper we indicate how this could be applied to machine learning which is now used by almost all research disciplines with the well-known effects of a huge lack of repeatability and reproducibility. Researchers will only change practices if they can work efficiently and are not loaded with additional tasks. A comprehensive CWFR framework would be an umbrella for all steps that need to be carried out to do machine learning on selected data collections and immediately create a comprehensive and FAIR compliant documentation. The researcher is guided by such a framework and information once entered can easily be shared and reused. The many iterations normally required in machine learning can be dealt with efficiently using CWFR methods. Libraries of components that can be easily orchestrated using FAIR Digital Objects as a common entity to document all actions and to exchange information between steps without the researcher needing to understand anything about PIDs and FDO details is probably the way to increase efficiency in repeating research workflows. As the Galaxy project indicates, the availability of supporting tools will be important to let researchers use these methods. Other as the Galaxy framework suggests, however, it would be necessary to include all steps necessary for doing a machine learning task including those that require human interaction and to document all phases with the help of structured FDOs.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41320073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Schröder, Eleonora Epp, A. Mozaffari, M. Romberg, Niklas Selke, M. Schultz
{"title":"Enabling Canonical Analysis Workflows Documented Data Harmonization on Global Air Quality Data","authors":"S. Schröder, Eleonora Epp, A. Mozaffari, M. Romberg, Niklas Selke, M. Schultz","doi":"10.1162/dint_a_00130","DOIUrl":"https://doi.org/10.1162/dint_a_00130","url":null,"abstract":"Abstract Data harmonization and documentation of the data processing are essential prerequisites for enabling Canonical Analysis Workflows. The recently revised Terabyte-scale air quality database system, which the Tropospheric Ozone Assessment Report (TOAR) created, contains one of the world's largest collections of near-surface air quality measurements and considers FAIR data principles as an integral part. A special feature of our data service is the on-demand processing and product generation of several air quality metrics directly from the underlying database. In this paper, we show that the necessary data harmonization for establishing such online analysis services goes much deeper than the obvious issues of common data formats, variable names, and measurement units, and we explore how the generation of FAIR Digital Objects (FDO) in combination with automatically generated documentation may support Canonical Analysis Workflows for air quality and related data.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64531481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao
{"title":"Scaling Notebooks as Re-configurable Cloud Workflows","authors":"Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao","doi":"10.1162/dint_a_00140","DOIUrl":"https://doi.org/10.1162/dint_a_00140","url":null,"abstract":"Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46210347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Amara, M. Conte, Allen J. Flynn, Jodyn E. Platt, Grace Trinidad
{"title":"Analysis of Pioneering Computable Biomedical Knowledge Repositories and their Emerging Governance Structures","authors":"P. Amara, M. Conte, Allen J. Flynn, Jodyn E. Platt, Grace Trinidad","doi":"10.1162/dint_a_00148","DOIUrl":"https://doi.org/10.1162/dint_a_00148","url":null,"abstract":"Abstract A growing interest in producing and sharing computable biomedical knowledge artifacts (CBKs) is increasing the demand for repositories that validate, catalog, and provide shared access to CBKs. However, there is a lack of evidence on how best to manage and sustain CBK repositories. In this paper, we present the results of interviews with several pioneering CBK repository owners. These interviews were informed by the Trusted Repositories Audit and Certification (TRAC) framework. Insights gained from these interviews suggest that the organizations operating CBK repositories are somewhat new, that their initial approaches to repository governance are informal, and that achieving economic sustainability for their CBK repositories is a major challenge. To enable a learning health system to make better use of its data intelligence, future approaches to CBK repository management will require enhanced governance and closer adherence to best practice frameworks to meet the needs of myriad biomedical science and health communities. More effort is needed to find sustainable funding models for accessible CBK artifact collections.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47280853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Canonical Workflows in Simulation-based Climate Sciences","authors":"I. Anders, Karsten Peters-von Gehlen, H. Thiemann","doi":"10.1162/dint_a_00127","DOIUrl":"https://doi.org/10.1162/dint_a_00127","url":null,"abstract":"Abstract In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulation-based research. We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center (DKRZ). What is special about this is that the DKRZ provides the climate science community with resources like high performance computing (HPC), data storage and specialised services, and hosts the World Data Center for Climate (WDCC). Therefore, users can perform their entire research workflows up to the publication of the data on the same infrastructure. Our analysis shows, that the resources are used by two primary user types: those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse, build-on and analyse existing data. We then further subdivided these top-level user categories based on their specific goals and analysed their typical, idealised workflows applied to achieve the respective project goals. We find that due to the subdivision and further granulation of the user groups, the workflows show apparent differences. Nevertheless, similar “Canonical Workflow Modules” can be clearly made out. These modules are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, “Data and Software Publication”, “Generating Knowledge” and in their entirety form the basis for a Canonical Workflow Framework for Research (CWFR). It is desirable that parts of the workflows in a CWFR act as FDOs, but we view this aspect critically. Also, we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44864013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Limor Peer, Claudia Biniossek, Dirk Betz, Thu-Mai Christian
{"title":"Reproducible Research Publication Workflow: A Canonical Workflow Framework and FAIR Digital Object Approach to Quality Research Output","authors":"Limor Peer, Claudia Biniossek, Dirk Betz, Thu-Mai Christian","doi":"10.1162/dint_a_00133","DOIUrl":"https://doi.org/10.1162/dint_a_00133","url":null,"abstract":"Abstract In this paper we present the Reproducible Research Publication Workflow (RRPW) as an example of how generic canonical workflows can be applied to a specific context. The RRPW includes essential steps between submission and final publication of the manuscript and the research artefacts (i.e., data, code, etc.) that underlie the scholarly claims in the manuscript. A key aspect of the RRPW is the inclusion of artefact review and metadata creation as part of the publication workflow. The paper discusses a formalized technical structure around a set of canonical steps which helps codify and standardize the process for researchers, curators, and publishers. The proposed application of canonical workflows can help achieve the goals of improved transparency and reproducibility, increase FAIR compliance of all research artefacts at all steps, and facilitate better exchange of annotated and machine-readable metadata.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46094283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using a Workflow Management Platform in Textual Data Management","authors":"T. Doan, S. Bingert, R. Yahyapour","doi":"10.1162/dint_a_00139","DOIUrl":"https://doi.org/10.1162/dint_a_00139","url":null,"abstract":"Abstract The paper gives a brief introduction about the workflow management platform, Flowable, and how it is used for textual-data management. It is relatively new with its first release on 13 October, 2016. Despite the short time on the market, it seems to be quickly well-noticed with 4.6 thousand stars on GitHub at the moment. The focus of our project is to build a platform for text analysis on a large scale by including many different text resources. Currently, we have successfully connected to four different text resources and obtained more than one million works. Some resources are dynamic, which means that they might add more data or modify their current data. Therefore, it is necessary to keep data, both the metadata and the raw data, from our side up to date with the resources. In addition, to comply with FAIR principles, each work is assigned a persistent identifier (PID) and indexed for searching purposes. In the last step, we perform some standard analyses on the data to enhance our search engine and to generate a knowledge graph. End-users can utilize our platform to search on our data or get access to the knowledge graph. Furthermore, they can submit their code for their analyses to the system. The code will be executed on a High-Performance Cluster (HPC) and users can receive the results later on. In this case, Flowable can take advantage of PIDs for digital objects identification and management to facilitate the communication with the HPC system. As one may already notice, the whole process can be expressed as a workflow. A workflow, including error handling and notification, has been created and deployed. Workflow execution can be triggered manually or after predefined time intervals. According to our evaluation, the Flowable platform proves to be powerful and flexible. Further usage of the platform is already planned or implemented for many of our projects.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44504533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}