Implementing and Managing a Data Curation Workflow in the Cloud

F. Rios, Chun Ly
{"title":"Implementing and Managing a Data Curation Workflow in the Cloud","authors":"F. Rios, Chun Ly","doi":"10.7191/jeslib.2021.1205","DOIUrl":null,"url":null,"abstract":"Objective: To increase data quality and ensure compliance with appropriate policies, many institutional data repositories curate data that is deposited into their systems. Here, we present our experience as an academic library implementing and managing a semi-automated, cloud-based data curation workflow for a recently launched institutional data repository. Based on our experiences we then present management observations intended for data repository managers and technical staff looking to move some or all of their curation services to the cloud.\n\nMethods: We implemented tooling for our curation workflow in a service-oriented manner, making significant use of our data repository platform’s application programming interface (API). With an eye towards sustainability, a guiding development philosophy has been to automate processes following industry best practices while avoiding solutions with high resource needs (e.g., maintenance), and minimizing the risk of becoming locked-in to specific tooling.\n\nResults: The initial barrier for implementing a data curation workflow in the cloud was high in comparison to on-premises curation, mainly due to the need to develop in-house cloud expertise. However, compared to the cost for on-premises servers and storage, infrastructure costs have been substantially lower. Furthermore, in our particular case, once the foundation had been established, a cloud approach resulted in increased agility allowing us to quickly automate our workflow as needed.\n\nConclusions: Workflow automation has put us on a path toward scaling the service and a cloud based-approach has helped with reduced initial costs. However, because cloud-based workflows and automation come with a maintenance overhead, it is important to build tooling that follows software development best practices and can be decoupled from curation workflows to avoid lock-in.","PeriodicalId":90214,"journal":{"name":"Journal of escience librarianship","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of escience librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7191/jeslib.2021.1205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To increase data quality and ensure compliance with appropriate policies, many institutional data repositories curate data that is deposited into their systems. Here, we present our experience as an academic library implementing and managing a semi-automated, cloud-based data curation workflow for a recently launched institutional data repository. Based on our experiences we then present management observations intended for data repository managers and technical staff looking to move some or all of their curation services to the cloud. Methods: We implemented tooling for our curation workflow in a service-oriented manner, making significant use of our data repository platform’s application programming interface (API). With an eye towards sustainability, a guiding development philosophy has been to automate processes following industry best practices while avoiding solutions with high resource needs (e.g., maintenance), and minimizing the risk of becoming locked-in to specific tooling. Results: The initial barrier for implementing a data curation workflow in the cloud was high in comparison to on-premises curation, mainly due to the need to develop in-house cloud expertise. However, compared to the cost for on-premises servers and storage, infrastructure costs have been substantially lower. Furthermore, in our particular case, once the foundation had been established, a cloud approach resulted in increased agility allowing us to quickly automate our workflow as needed. Conclusions: Workflow automation has put us on a path toward scaling the service and a cloud based-approach has helped with reduced initial costs. However, because cloud-based workflows and automation come with a maintenance overhead, it is important to build tooling that follows software development best practices and can be decoupled from curation workflows to avoid lock-in.
在云中实现和管理数据管理工作流
目的:为了提高数据质量并确保符合适当的政策,许多机构数据存储库管理存储到其系统中的数据。在这里,我们将介绍我们作为一个学术图书馆为最近推出的机构数据存储库实施和管理半自动、基于云的数据管理工作流的经验。然后,根据我们的经验,我们提出了一些管理观察,旨在为希望将部分或全部管理服务迁移到云上的数据存储库管理人员和技术人员提供参考。方法:我们以面向服务的方式实现了管理工作流的工具,充分利用了我们的数据存储库平台的应用程序编程接口(API)。着眼于可持续性,一个指导性的开发理念是遵循行业最佳实践自动化过程,同时避免高资源需求的解决方案(例如,维护),并最小化被锁定到特定工具的风险。结果:与本地管理相比,在云中实施数据管理工作流的最初障碍很高,主要是因为需要开发内部云专业知识。然而,与本地服务器和存储的成本相比,基础设施的成本要低得多。此外,在我们的特殊情况下,一旦建立了基础,云方法就会增加敏捷性,使我们能够根据需要快速自动化工作流程。结论:工作流自动化使我们走上了扩展服务的道路,基于云的方法有助于降低初始成本。然而,由于基于云的工作流和自动化带来了维护开销,因此构建遵循软件开发最佳实践的工具非常重要,并且可以与管理工作流解耦以避免锁定。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信