基于变压器的视听多模态融合技术,用于精细识别母猪的哺乳行为

IF 8.2 Q1 AGRICULTURE, MULTIDISCIPLINARY
Yuqing Yang , Chengguo Xu , Wenhao Hou , Alan G. McElligott , Kai Liu , Yueju Xue
{"title":"基于变压器的视听多模态融合技术,用于精细识别母猪的哺乳行为","authors":"Yuqing Yang ,&nbsp;Chengguo Xu ,&nbsp;Wenhao Hou ,&nbsp;Alan G. McElligott ,&nbsp;Kai Liu ,&nbsp;Yueju Xue","doi":"10.1016/j.aiia.2025.03.006","DOIUrl":null,"url":null,"abstract":"<div><div>Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status. However, accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting, rail obstructions, and interference from other sows' calls. Multimodal fusion, which integrates audio and visual data, has proven to be an effective approach for improving accuracy and robustness in complex scenarios. In this study, we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture, along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction. Specifically, we proposed a novel transformer-based audio-visual multimodal fusion (TMF) framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound. Initially, a unimodal self-attention enhancement (USE) module was employed to augment video and audio features with global contextual information. Subsequently, we developed an audio-visual interaction enhancement (AVIE) module to compress relevant information and reduce noise using the information bottleneck principle. Moreover, we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality. Finally, we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information, while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow. Our results demonstrate that the proposed method achieves an accuracy of 98.42 % for general sow nursing behaviour and 94.37 % for fine-grained nursing behaviour, including nursing with and without the calling-to-nurse sound, and non-nursing behaviours. This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness, thereby enhancing management practices in pig farming.</div></div>","PeriodicalId":52814,"journal":{"name":"Artificial Intelligence in Agriculture","volume":"15 3","pages":"Pages 363-376"},"PeriodicalIF":8.2000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-based audio-visual multimodal fusion for fine-grained recognition of individual sow nursing behaviour\",\"authors\":\"Yuqing Yang ,&nbsp;Chengguo Xu ,&nbsp;Wenhao Hou ,&nbsp;Alan G. McElligott ,&nbsp;Kai Liu ,&nbsp;Yueju Xue\",\"doi\":\"10.1016/j.aiia.2025.03.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status. However, accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting, rail obstructions, and interference from other sows' calls. Multimodal fusion, which integrates audio and visual data, has proven to be an effective approach for improving accuracy and robustness in complex scenarios. In this study, we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture, along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction. Specifically, we proposed a novel transformer-based audio-visual multimodal fusion (TMF) framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound. Initially, a unimodal self-attention enhancement (USE) module was employed to augment video and audio features with global contextual information. Subsequently, we developed an audio-visual interaction enhancement (AVIE) module to compress relevant information and reduce noise using the information bottleneck principle. Moreover, we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality. Finally, we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information, while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow. Our results demonstrate that the proposed method achieves an accuracy of 98.42 % for general sow nursing behaviour and 94.37 % for fine-grained nursing behaviour, including nursing with and without the calling-to-nurse sound, and non-nursing behaviours. This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness, thereby enhancing management practices in pig farming.</div></div>\",\"PeriodicalId\":52814,\"journal\":{\"name\":\"Artificial Intelligence in Agriculture\",\"volume\":\"15 3\",\"pages\":\"Pages 363-376\"},\"PeriodicalIF\":8.2000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence in Agriculture\",\"FirstCategoryId\":\"1087\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2589721725000376\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Agriculture","FirstCategoryId":"1087","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589721725000376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

哺乳行为和母猪叫声是评估母猪母性行为和哺乳状况的重要指标。然而,在复杂的室内猪舍中准确识别母猪的这些行为具有挑战性,原因包括光照变化、栏杆障碍物以及其他母猪叫声的干扰。多模态融合(将音频和视觉数据整合在一起)已被证明是在复杂场景中提高准确性和鲁棒性的有效方法。在本研究中,我们设计了一个视听数据采集系统,其中包括一个用于同步采集音频和视频的摄像头,以及一个利用声音传感器追踪声音方向的定制开发的声源定位系统。具体来说,我们提出了一种基于变压器的新型视听多模态融合(TMF)框架,用于识别有或没有母猪叫声的细粒度母猪哺乳行为。最初,我们采用了单模态自我注意增强(USE)模块,利用全局上下文信息增强视频和音频特征。随后,我们开发了视听交互增强(AVIE)模块,利用信息瓶颈原理压缩相关信息并减少噪音。此外,我们还提出了一种自适应动态决策融合策略,通过关注每种模式中最相关的特征来优化模型的性能。最后,我们通过整合音频和融合信息,全面识别了细粒度的哺乳行为,同时结合实时声源定位系统的角度信息,准确判断声音线索是否来自目标母猪。我们的研究结果表明,所提出的方法对一般母猪哺乳行为的准确率达到 98.42%,对细粒度哺乳行为的准确率达到 94.37%,其中包括发出或未发出母猪叫声的哺乳行为以及非哺乳行为。这种精细的哺乳信息可以让人更细致地了解母猪的健康状况和泌乳意愿,从而改进养猪业的管理方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Transformer-based audio-visual multimodal fusion for fine-grained recognition of individual sow nursing behaviour
Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status. However, accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting, rail obstructions, and interference from other sows' calls. Multimodal fusion, which integrates audio and visual data, has proven to be an effective approach for improving accuracy and robustness in complex scenarios. In this study, we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture, along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction. Specifically, we proposed a novel transformer-based audio-visual multimodal fusion (TMF) framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound. Initially, a unimodal self-attention enhancement (USE) module was employed to augment video and audio features with global contextual information. Subsequently, we developed an audio-visual interaction enhancement (AVIE) module to compress relevant information and reduce noise using the information bottleneck principle. Moreover, we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality. Finally, we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information, while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow. Our results demonstrate that the proposed method achieves an accuracy of 98.42 % for general sow nursing behaviour and 94.37 % for fine-grained nursing behaviour, including nursing with and without the calling-to-nurse sound, and non-nursing behaviours. This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness, thereby enhancing management practices in pig farming.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Artificial Intelligence in Agriculture
Artificial Intelligence in Agriculture Engineering-Engineering (miscellaneous)
CiteScore
21.60
自引率
0.00%
发文量
18
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信