Jacquelyn P. Boerman , Luiz F. Brito , Maria E. Montes , Jacob M. Maskal , Jarrod Doucette , Kirby Kalbaugh
{"title":"数据处理技术,以改善来自奶牛场的数据整合","authors":"Jacquelyn P. Boerman , Luiz F. Brito , Maria E. Montes , Jacob M. Maskal , Jarrod Doucette , Kirby Kalbaugh","doi":"10.3168/jdsc.2024-0723","DOIUrl":null,"url":null,"abstract":"<div><div>Large-scale data generation on dairy cattle farms is expected to continue increasing due to more animals per farm and the adoption of on-farm sensors and technologies that generate additional information on individual animals in greater frequency. Siloed data and information, lacking interoperability, prevent end users from combining data from multiple data sources and drawing more meaningful conclusions from the data generated on farm. As a result of these data challenges, the objective of this technical note is to describe a process of designing and documenting the development of a data ecosystem that automatically collects, performs quality control, and integrates data from disparate data sources used on experimental and commercial dairy farms. Integrated data can be queried to answer specific questions or generate timed reports that provide more insight than singular data sources can provide. Our objective was to develop a collaborative research data infrastructure that enables comprehensive data accessibility through an integrated computational ecosystem comprising open-source technologies of JupyterHub, Python, and Apache Spark. This shared curated environment facilitates extensive dataset consumption, empowering users to leverage distributed computing resources and parallel processing capabilities for sophisticated multi-dataset analysis and integration. Before user accessibility, the farm data undergo a rigorous multistage preprocessing protocol designed to mitigate potential data integrity challenges. These comprehensive data curation steps systematically address complex variability with sources, including vendor-specific software modifications, intermittent data retrieval disruptions, and farm-level operational contingencies. Employing sophisticated data cleaning, transformation, and validation methodologies, the infrastructure ensures robust data standardization and quality assurance. The integration of datasets from different data sources is paramount for improving dairy cattle welfare and production efficiency, which are complex management and breeding goals influenced by a multitude of traits that can be measured by different sensors. We identified research and further development needed in the field of dairy data science (e.g., data editing and quality control procedures, references and standards for novel sensor-based variables, and validation of obtained data across sensors), which is expected to continue playing a major role in the dairy industry sustainability.</div></div>","PeriodicalId":94061,"journal":{"name":"JDS communications","volume":"6 3","pages":"Pages 339-344"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data processing techniques to improve data integration from dairy farms\",\"authors\":\"Jacquelyn P. Boerman , Luiz F. Brito , Maria E. Montes , Jacob M. Maskal , Jarrod Doucette , Kirby Kalbaugh\",\"doi\":\"10.3168/jdsc.2024-0723\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large-scale data generation on dairy cattle farms is expected to continue increasing due to more animals per farm and the adoption of on-farm sensors and technologies that generate additional information on individual animals in greater frequency. Siloed data and information, lacking interoperability, prevent end users from combining data from multiple data sources and drawing more meaningful conclusions from the data generated on farm. As a result of these data challenges, the objective of this technical note is to describe a process of designing and documenting the development of a data ecosystem that automatically collects, performs quality control, and integrates data from disparate data sources used on experimental and commercial dairy farms. Integrated data can be queried to answer specific questions or generate timed reports that provide more insight than singular data sources can provide. Our objective was to develop a collaborative research data infrastructure that enables comprehensive data accessibility through an integrated computational ecosystem comprising open-source technologies of JupyterHub, Python, and Apache Spark. This shared curated environment facilitates extensive dataset consumption, empowering users to leverage distributed computing resources and parallel processing capabilities for sophisticated multi-dataset analysis and integration. Before user accessibility, the farm data undergo a rigorous multistage preprocessing protocol designed to mitigate potential data integrity challenges. These comprehensive data curation steps systematically address complex variability with sources, including vendor-specific software modifications, intermittent data retrieval disruptions, and farm-level operational contingencies. Employing sophisticated data cleaning, transformation, and validation methodologies, the infrastructure ensures robust data standardization and quality assurance. The integration of datasets from different data sources is paramount for improving dairy cattle welfare and production efficiency, which are complex management and breeding goals influenced by a multitude of traits that can be measured by different sensors. We identified research and further development needed in the field of dairy data science (e.g., data editing and quality control procedures, references and standards for novel sensor-based variables, and validation of obtained data across sensors), which is expected to continue playing a major role in the dairy industry sustainability.</div></div>\",\"PeriodicalId\":94061,\"journal\":{\"name\":\"JDS communications\",\"volume\":\"6 3\",\"pages\":\"Pages 339-344\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JDS communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666910225000389\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JDS communications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666910225000389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Data processing techniques to improve data integration from dairy farms
Large-scale data generation on dairy cattle farms is expected to continue increasing due to more animals per farm and the adoption of on-farm sensors and technologies that generate additional information on individual animals in greater frequency. Siloed data and information, lacking interoperability, prevent end users from combining data from multiple data sources and drawing more meaningful conclusions from the data generated on farm. As a result of these data challenges, the objective of this technical note is to describe a process of designing and documenting the development of a data ecosystem that automatically collects, performs quality control, and integrates data from disparate data sources used on experimental and commercial dairy farms. Integrated data can be queried to answer specific questions or generate timed reports that provide more insight than singular data sources can provide. Our objective was to develop a collaborative research data infrastructure that enables comprehensive data accessibility through an integrated computational ecosystem comprising open-source technologies of JupyterHub, Python, and Apache Spark. This shared curated environment facilitates extensive dataset consumption, empowering users to leverage distributed computing resources and parallel processing capabilities for sophisticated multi-dataset analysis and integration. Before user accessibility, the farm data undergo a rigorous multistage preprocessing protocol designed to mitigate potential data integrity challenges. These comprehensive data curation steps systematically address complex variability with sources, including vendor-specific software modifications, intermittent data retrieval disruptions, and farm-level operational contingencies. Employing sophisticated data cleaning, transformation, and validation methodologies, the infrastructure ensures robust data standardization and quality assurance. The integration of datasets from different data sources is paramount for improving dairy cattle welfare and production efficiency, which are complex management and breeding goals influenced by a multitude of traits that can be measured by different sensors. We identified research and further development needed in the field of dairy data science (e.g., data editing and quality control procedures, references and standards for novel sensor-based variables, and validation of obtained data across sensors), which is expected to continue playing a major role in the dairy industry sustainability.