Data processing techniques to improve data integration from dairy farms

Jacquelyn P. Boerman , Luiz F. Brito , Maria E. Montes , Jacob M. Maskal , Jarrod Doucette , Kirby Kalbaugh
{"title":"Data processing techniques to improve data integration from dairy farms","authors":"Jacquelyn P. Boerman ,&nbsp;Luiz F. Brito ,&nbsp;Maria E. Montes ,&nbsp;Jacob M. Maskal ,&nbsp;Jarrod Doucette ,&nbsp;Kirby Kalbaugh","doi":"10.3168/jdsc.2024-0723","DOIUrl":null,"url":null,"abstract":"<div><div>Large-scale data generation on dairy cattle farms is expected to continue increasing due to more animals per farm and the adoption of on-farm sensors and technologies that generate additional information on individual animals in greater frequency. Siloed data and information, lacking interoperability, prevent end users from combining data from multiple data sources and drawing more meaningful conclusions from the data generated on farm. As a result of these data challenges, the objective of this technical note is to describe a process of designing and documenting the development of a data ecosystem that automatically collects, performs quality control, and integrates data from disparate data sources used on experimental and commercial dairy farms. Integrated data can be queried to answer specific questions or generate timed reports that provide more insight than singular data sources can provide. Our objective was to develop a collaborative research data infrastructure that enables comprehensive data accessibility through an integrated computational ecosystem comprising open-source technologies of JupyterHub, Python, and Apache Spark. This shared curated environment facilitates extensive dataset consumption, empowering users to leverage distributed computing resources and parallel processing capabilities for sophisticated multi-dataset analysis and integration. Before user accessibility, the farm data undergo a rigorous multistage preprocessing protocol designed to mitigate potential data integrity challenges. These comprehensive data curation steps systematically address complex variability with sources, including vendor-specific software modifications, intermittent data retrieval disruptions, and farm-level operational contingencies. Employing sophisticated data cleaning, transformation, and validation methodologies, the infrastructure ensures robust data standardization and quality assurance. The integration of datasets from different data sources is paramount for improving dairy cattle welfare and production efficiency, which are complex management and breeding goals influenced by a multitude of traits that can be measured by different sensors. We identified research and further development needed in the field of dairy data science (e.g., data editing and quality control procedures, references and standards for novel sensor-based variables, and validation of obtained data across sensors), which is expected to continue playing a major role in the dairy industry sustainability.</div></div>","PeriodicalId":94061,"journal":{"name":"JDS communications","volume":"6 3","pages":"Pages 339-344"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JDS communications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666910225000389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large-scale data generation on dairy cattle farms is expected to continue increasing due to more animals per farm and the adoption of on-farm sensors and technologies that generate additional information on individual animals in greater frequency. Siloed data and information, lacking interoperability, prevent end users from combining data from multiple data sources and drawing more meaningful conclusions from the data generated on farm. As a result of these data challenges, the objective of this technical note is to describe a process of designing and documenting the development of a data ecosystem that automatically collects, performs quality control, and integrates data from disparate data sources used on experimental and commercial dairy farms. Integrated data can be queried to answer specific questions or generate timed reports that provide more insight than singular data sources can provide. Our objective was to develop a collaborative research data infrastructure that enables comprehensive data accessibility through an integrated computational ecosystem comprising open-source technologies of JupyterHub, Python, and Apache Spark. This shared curated environment facilitates extensive dataset consumption, empowering users to leverage distributed computing resources and parallel processing capabilities for sophisticated multi-dataset analysis and integration. Before user accessibility, the farm data undergo a rigorous multistage preprocessing protocol designed to mitigate potential data integrity challenges. These comprehensive data curation steps systematically address complex variability with sources, including vendor-specific software modifications, intermittent data retrieval disruptions, and farm-level operational contingencies. Employing sophisticated data cleaning, transformation, and validation methodologies, the infrastructure ensures robust data standardization and quality assurance. The integration of datasets from different data sources is paramount for improving dairy cattle welfare and production efficiency, which are complex management and breeding goals influenced by a multitude of traits that can be measured by different sensors. We identified research and further development needed in the field of dairy data science (e.g., data editing and quality control procedures, references and standards for novel sensor-based variables, and validation of obtained data across sensors), which is expected to continue playing a major role in the dairy industry sustainability.
数据处理技术,以改善来自奶牛场的数据整合
由于每个农场的牲畜数量增加,以及农场传感器和技术的采用,奶牛养殖场的大规模数据生成预计将继续增加,这些传感器和技术可以更频繁地生成单个动物的额外信息。孤立的数据和信息缺乏互操作性,阻碍了最终用户将来自多个数据源的数据组合在一起,并从农场生成的数据中得出更有意义的结论。由于这些数据方面的挑战,本技术说明的目的是描述一个数据生态系统的设计和记录开发过程,该系统可以自动收集、执行质量控制,并集成来自实验和商业奶牛场使用的不同数据源的数据。可以查询集成数据以回答特定问题或生成定时报告,这些报告提供比单一数据源更深入的见解。我们的目标是开发一个协作研究数据基础设施,通过一个集成的计算生态系统,包括JupyterHub、Python和Apache Spark的开源技术,实现全面的数据访问。这种共享的管理环境促进了广泛的数据集消费,使用户能够利用分布式计算资源和并行处理能力进行复杂的多数据集分析和集成。在用户访问之前,油田数据经过严格的多阶段预处理协议,旨在减轻潜在的数据完整性挑战。这些全面的数据管理步骤系统地解决了来源的复杂可变性,包括供应商特定的软件修改、间歇性数据检索中断和农场级操作突发事件。该基础设施采用复杂的数据清理、转换和验证方法,确保了健壮的数据标准化和质量保证。整合来自不同数据源的数据集对于提高奶牛福利和生产效率至关重要,这是一个复杂的管理和育种目标,受多种特征的影响,这些特征可以通过不同的传感器测量。我们确定了乳制品数据科学领域需要进行的研究和进一步发展(例如,数据编辑和质量控制程序,基于传感器的新型变量的参考和标准,以及跨传感器获得的数据的验证),预计这些将继续在乳制品行业的可持续性中发挥重要作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JDS communications
JDS communications Animal Science and Zoology
CiteScore
2.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信