大规模搜寻数据盗贼:研究基础设施中观测数据的数据质量控制

G. Pastorello, D. Gunter, H. Chu, D. Christianson, C. Trotta, E. Canfora, B. Faybishenko, Y. Cheah, N. Beekwilder, S. Chan, S. Dengel, T. Keenan, F. O'Brien, Abdelrahman Elbashandy, C. Poindexter, M. Humphrey, D. Papale, D. Agarwal
{"title":"大规模搜寻数据盗贼:研究基础设施中观测数据的数据质量控制","authors":"G. Pastorello, D. Gunter, H. Chu, D. Christianson, C. Trotta, E. Canfora, B. Faybishenko, Y. Cheah, N. Beekwilder, S. Chan, S. Dengel, T. Keenan, F. O'Brien, Abdelrahman Elbashandy, C. Poindexter, M. Humphrey, D. Papale, D. Agarwal","doi":"10.1109/ESCIENCE.2017.64","DOIUrl":null,"url":null,"abstract":"Data quality control is one of the most time consuming activities within Research Infrastructures (RIs), especially when involving observational data and multiple data providers. In this work we report on our ongoing development of data rogues, a scalable approach to manage data quality issues for observational data within RIs. The motivation for this work started with the creation of the FLUXNET2015 dataset, which includes carbon, water, and energy fluxes plus micrometeorological and ancillary data measured in over 200 sites around the world. To create an uniform dataset, including derived data products, extensive work on data quality control was needed. The unpredictable nature of observational data quality issues makes the automation of data quality control inherently difficult. Developed based on this experience, the data rogues methodology allows for increased automation of quality control activities by systematically identifying, cataloging, and documenting implementations of solutions to data issues. We believe this methodology can be extended and applied to others domains and types of data, making the automation of data quality control a more tractable problem.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"218 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Hunting Data Rogues at Scale: Data Quality Control for Observational Data in Research Infrastructures\",\"authors\":\"G. Pastorello, D. Gunter, H. Chu, D. Christianson, C. Trotta, E. Canfora, B. Faybishenko, Y. Cheah, N. Beekwilder, S. Chan, S. Dengel, T. Keenan, F. O'Brien, Abdelrahman Elbashandy, C. Poindexter, M. Humphrey, D. Papale, D. Agarwal\",\"doi\":\"10.1109/ESCIENCE.2017.64\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data quality control is one of the most time consuming activities within Research Infrastructures (RIs), especially when involving observational data and multiple data providers. In this work we report on our ongoing development of data rogues, a scalable approach to manage data quality issues for observational data within RIs. The motivation for this work started with the creation of the FLUXNET2015 dataset, which includes carbon, water, and energy fluxes plus micrometeorological and ancillary data measured in over 200 sites around the world. To create an uniform dataset, including derived data products, extensive work on data quality control was needed. The unpredictable nature of observational data quality issues makes the automation of data quality control inherently difficult. Developed based on this experience, the data rogues methodology allows for increased automation of quality control activities by systematically identifying, cataloging, and documenting implementations of solutions to data issues. We believe this methodology can be extended and applied to others domains and types of data, making the automation of data quality control a more tractable problem.\",\"PeriodicalId\":137652,\"journal\":{\"name\":\"2017 IEEE 13th International Conference on e-Science (e-Science)\",\"volume\":\"218 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 13th International Conference on e-Science (e-Science)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESCIENCE.2017.64\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 13th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESCIENCE.2017.64","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

数据质量控制是研究基础设施(RIs)中最耗时的活动之一,特别是当涉及观测数据和多个数据提供者时。在这项工作中,我们报告了我们正在进行的数据流氓开发,这是一种可扩展的方法,用于管理RIs内观测数据的数据质量问题。这项工作的动机始于FLUXNET2015数据集的创建,该数据集包括碳、水和能量通量,以及在全球200多个地点测量的微气象和辅助数据。为了创建统一的数据集,包括派生数据产品,需要在数据质量控制方面进行大量工作。观测数据质量问题的不可预测性使得数据质量控制的自动化本身就很困难。基于这一经验,数据盗贼方法通过系统地识别、编目和记录数据问题解决方案的实现,增加了质量控制活动的自动化。我们相信这种方法可以扩展并应用于其他领域和数据类型,使数据质量控制的自动化成为一个更容易处理的问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Hunting Data Rogues at Scale: Data Quality Control for Observational Data in Research Infrastructures
Data quality control is one of the most time consuming activities within Research Infrastructures (RIs), especially when involving observational data and multiple data providers. In this work we report on our ongoing development of data rogues, a scalable approach to manage data quality issues for observational data within RIs. The motivation for this work started with the creation of the FLUXNET2015 dataset, which includes carbon, water, and energy fluxes plus micrometeorological and ancillary data measured in over 200 sites around the world. To create an uniform dataset, including derived data products, extensive work on data quality control was needed. The unpredictable nature of observational data quality issues makes the automation of data quality control inherently difficult. Developed based on this experience, the data rogues methodology allows for increased automation of quality control activities by systematically identifying, cataloging, and documenting implementations of solutions to data issues. We believe this methodology can be extended and applied to others domains and types of data, making the automation of data quality control a more tractable problem.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信