Data Intelligence最新文献_第10页

Scaling Notebooks as Re-configurable Cloud Workflows 将笔记本扩展为可重新配置的云工作流

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00140

Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao

{"title":"Scaling Notebooks as Re-configurable Cloud Workflows","authors":"Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao","doi":"10.1162/dint_a_00140","DOIUrl":"https://doi.org/10.1162/dint_a_00140","url":null,"abstract":"Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"409-425"},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46210347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Analysis of Pioneering Computable Biomedical Knowledge Repositories and their Emerging Governance Structures 开创性的可计算生物医学知识库及其新兴治理结构分析

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-14 DOI: 10.1162/dint_a_00148

P. Amara, M. Conte, Allen J. Flynn, Jodyn E. Platt, Grace Trinidad

{"title":"Analysis of Pioneering Computable Biomedical Knowledge Repositories and their Emerging Governance Structures","authors":"P. Amara, M. Conte, Allen J. Flynn, Jodyn E. Platt, Grace Trinidad","doi":"10.1162/dint_a_00148","DOIUrl":"https://doi.org/10.1162/dint_a_00148","url":null,"abstract":"Abstract A growing interest in producing and sharing computable biomedical knowledge artifacts (CBKs) is increasing the demand for repositories that validate, catalog, and provide shared access to CBKs. However, there is a lack of evidence on how best to manage and sustain CBK repositories. In this paper, we present the results of interviews with several pioneering CBK repository owners. These interviews were informed by the Trusted Repositories Audit and Certification (TRAC) framework. Insights gained from these interviews suggest that the organizations operating CBK repositories are somewhat new, that their initial approaches to repository governance are informal, and that achieving economic sustainability for their CBK repositories is a major challenge. To enable a learning health system to make better use of its data intelligence, future approaches to CBK repository management will require enhanced governance and closer adherence to best practice frameworks to meet the needs of myriad biomedical science and health communities. More effort is needed to find sustainable funding models for accessible CBK artifact collections.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"653-670"},"PeriodicalIF":3.9,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47280853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Canonical Workflows in Simulation-based Climate Sciences 基于模拟的气候科学中的规范工作流程

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00127

I. Anders, Karsten Peters-von Gehlen, H. Thiemann

{"title":"Canonical Workflows in Simulation-based Climate Sciences","authors":"I. Anders, Karsten Peters-von Gehlen, H. Thiemann","doi":"10.1162/dint_a_00127","DOIUrl":"https://doi.org/10.1162/dint_a_00127","url":null,"abstract":"Abstract In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulation-based research. We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center (DKRZ). What is special about this is that the DKRZ provides the climate science community with resources like high performance computing (HPC), data storage and specialised services, and hosts the World Data Center for Climate (WDCC). Therefore, users can perform their entire research workflows up to the publication of the data on the same infrastructure. Our analysis shows, that the resources are used by two primary user types: those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse, build-on and analyse existing data. We then further subdivided these top-level user categories based on their specific goals and analysed their typical, idealised workflows applied to achieve the respective project goals. We find that due to the subdivision and further granulation of the user groups, the workflows show apparent differences. Nevertheless, similar “Canonical Workflow Modules” can be clearly made out. These modules are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, “Data and Software Publication”, “Generating Knowledge” and in their entirety form the basis for a Canonical Workflow Framework for Research (CWFR). It is desirable that parts of the workflows in a CWFR act as FDOs, but we view this aspect critically. Also, we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"212-225"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44864013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Reproducible Research Publication Workflow: A Canonical Workflow Framework and FAIR Digital Object Approach to Quality Research Output 可重复的研究出版工作流程:一个规范的工作流程框架和公平的数字对象方法的质量研究成果

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00133

Limor Peer, Claudia Biniossek, Dirk Betz, Thu-Mai Christian

引用次数: 1

Using a Workflow Management Platform in Textual Data Management 工作流管理平台在文本数据管理中的应用

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00139

T. Doan, S. Bingert, R. Yahyapour

{"title":"Using a Workflow Management Platform in Textual Data Management","authors":"T. Doan, S. Bingert, R. Yahyapour","doi":"10.1162/dint_a_00139","DOIUrl":"https://doi.org/10.1162/dint_a_00139","url":null,"abstract":"Abstract The paper gives a brief introduction about the workflow management platform, Flowable, and how it is used for textual-data management. It is relatively new with its first release on 13 October, 2016. Despite the short time on the market, it seems to be quickly well-noticed with 4.6 thousand stars on GitHub at the moment. The focus of our project is to build a platform for text analysis on a large scale by including many different text resources. Currently, we have successfully connected to four different text resources and obtained more than one million works. Some resources are dynamic, which means that they might add more data or modify their current data. Therefore, it is necessary to keep data, both the metadata and the raw data, from our side up to date with the resources. In addition, to comply with FAIR principles, each work is assigned a persistent identifier (PID) and indexed for searching purposes. In the last step, we perform some standard analyses on the data to enhance our search engine and to generate a knowledge graph. End-users can utilize our platform to search on our data or get access to the knowledge graph. Furthermore, they can submit their code for their analyses to the system. The code will be executed on a High-Performance Cluster (HPC) and users can receive the results later on. In this case, Flowable can take advantage of PIDs for digital objects identification and management to facilitate the communication with the HPC system. As one may already notice, the whole process can be expressed as a workflow. A workflow, including error handling and notification, has been created and deployed. Workflow execution can be triggered manually or after predefined time intervals. According to our evaluation, the Flowable platform proves to be powerful and flexible. Further usage of the platform is already planned or implemented for many of our projects.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"398-408"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44504533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Canonical Workflows to Make Data FAIR 规范工作流使数据公平

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00132

P. Wittenburg, A. Hardisty, Yann Le Franc, A. Mozaffari, Limor Peer, N. Skvortsov, Zhiming Zhao, A. Spinuso

引用次数: 3

S-ProvFlow. Storing and Exploring Lineage Data as a Service S-ProvFlow。将血统数据作为服务进行存储和探索

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00128

A. Spinuso, M. Atkinson, F. Magnoni

{"title":"S-ProvFlow. Storing and Exploring Lineage Data as a Service","authors":"A. Spinuso, M. Atkinson, F. Magnoni","doi":"10.1162/dint_a_00128","DOIUrl":"https://doi.org/10.1162/dint_a_00128","url":null,"abstract":"Abstract We present a set of configurable Web service and interactive tools, s-ProvFlow, for managing and exploiting records tracking data lineage during workflow runs. It facilitates detailed analysis of single executions. It helps users manage complex tasks by exposing the relationships between data, people, equipment and workflow runs intended to combine productively. Its logical model extends the PROV standard to precisely record parallel data-streaming applications. Its metadata handling encourages users to capture the application context by specifying how application attributes, often using standard vocabularies, should be added. These metadata records immediately help productivity as the interactive tools support their use in selection and bulk operations. Users rapidly appreciate the power of the encoded semantics as they reap the benefits. This improves the quality of provenance for users and management. Which in turn facilitates analysis of collections of runs, enabling users to manage results and validate procedures. It fosters reuse of data and methods and facilitates diagnostic investigations and optimisations. We present S-ProvFlow's use by scientists, research engineers and managers as part of the DARE hyper-platform as they create, validate and use their data-driven scientific workflows.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"226-242"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41411525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections 标本数据精炼厂:一个规范的工作流程框架和公平的数字对象方法，以加速自然历史藏品的数字化动员

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00134

A. Hardisty, P. Brack, C. Goble, Laurence Livermore, Ben Scott, Q. Groom, S. Owen, S. Soiland-Reyes

{"title":"The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections","authors":"A. Hardisty, P. Brack, C. Goble, Laurence Livermore, Ben Scott, Q. Groom, S. Owen, S. Soiland-Reyes","doi":"10.1162/dint_a_00134","DOIUrl":"https://doi.org/10.1162/dint_a_00134","url":null,"abstract":"Abstract A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"320-341"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43208997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A Workflow Demonstrator for Processing Catalysis Research Data 用于处理催化研究数据的工作流演示器

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00143

A. N. L. Hidalga, Donato Decarolis, Shaojun Xu, S. Matam, Willinton Yesid Hernández Enciso, Joseph B. Goodall, B. Matthews, C. Catlow

引用次数: 4

HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction 面向HPC的典型工作流在气候和天气预测中的机器学习应用

IF 3.9 3区计算机科学

Data Intelligence Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00131

A. Mozaffari, M. Langguth, Bing Gong, Jessica Ahring, Adrian Rojas Campos, Pascal Nieters, Otoniel José Campos Escobar, M. Wittenbrink, P. Baumann, M. Schultz

{"title":"HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction","authors":"A. Mozaffari, M. Langguth, Bing Gong, Jessica Ahring, Adrian Rojas Campos, Pascal Nieters, Otoniel José Campos Escobar, M. Wittenbrink, P. Baumann, M. Schultz","doi":"10.1162/dint_a_00131","DOIUrl":"https://doi.org/10.1162/dint_a_00131","url":null,"abstract":"Abstract Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"271-285"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45520849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3