Yuqing Yang , Chengguo Xu , Wenhao Hou , Alan G. McElligott , Kai Liu , Yueju Xue
{"title":"基于变压器的视听多模态融合技术,用于精细识别母猪的哺乳行为","authors":"Yuqing Yang , Chengguo Xu , Wenhao Hou , Alan G. McElligott , Kai Liu , Yueju Xue","doi":"10.1016/j.aiia.2025.03.006","DOIUrl":null,"url":null,"abstract":"<div><div>Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status. However, accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting, rail obstructions, and interference from other sows' calls. Multimodal fusion, which integrates audio and visual data, has proven to be an effective approach for improving accuracy and robustness in complex scenarios. In this study, we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture, along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction. Specifically, we proposed a novel transformer-based audio-visual multimodal fusion (TMF) framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound. Initially, a unimodal self-attention enhancement (USE) module was employed to augment video and audio features with global contextual information. Subsequently, we developed an audio-visual interaction enhancement (AVIE) module to compress relevant information and reduce noise using the information bottleneck principle. Moreover, we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality. Finally, we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information, while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow. Our results demonstrate that the proposed method achieves an accuracy of 98.42 % for general sow nursing behaviour and 94.37 % for fine-grained nursing behaviour, including nursing with and without the calling-to-nurse sound, and non-nursing behaviours. This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness, thereby enhancing management practices in pig farming.</div></div>","PeriodicalId":52814,"journal":{"name":"Artificial Intelligence in Agriculture","volume":"15 3","pages":"Pages 363-376"},"PeriodicalIF":8.2000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-based audio-visual multimodal fusion for fine-grained recognition of individual sow nursing behaviour\",\"authors\":\"Yuqing Yang , Chengguo Xu , Wenhao Hou , Alan G. McElligott , Kai Liu , Yueju Xue\",\"doi\":\"10.1016/j.aiia.2025.03.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status. However, accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting, rail obstructions, and interference from other sows' calls. Multimodal fusion, which integrates audio and visual data, has proven to be an effective approach for improving accuracy and robustness in complex scenarios. In this study, we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture, along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction. Specifically, we proposed a novel transformer-based audio-visual multimodal fusion (TMF) framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound. Initially, a unimodal self-attention enhancement (USE) module was employed to augment video and audio features with global contextual information. Subsequently, we developed an audio-visual interaction enhancement (AVIE) module to compress relevant information and reduce noise using the information bottleneck principle. Moreover, we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality. Finally, we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information, while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow. Our results demonstrate that the proposed method achieves an accuracy of 98.42 % for general sow nursing behaviour and 94.37 % for fine-grained nursing behaviour, including nursing with and without the calling-to-nurse sound, and non-nursing behaviours. This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness, thereby enhancing management practices in pig farming.</div></div>\",\"PeriodicalId\":52814,\"journal\":{\"name\":\"Artificial Intelligence in Agriculture\",\"volume\":\"15 3\",\"pages\":\"Pages 363-376\"},\"PeriodicalIF\":8.2000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence in Agriculture\",\"FirstCategoryId\":\"1087\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2589721725000376\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Agriculture","FirstCategoryId":"1087","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589721725000376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}
Transformer-based audio-visual multimodal fusion for fine-grained recognition of individual sow nursing behaviour
Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status. However, accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting, rail obstructions, and interference from other sows' calls. Multimodal fusion, which integrates audio and visual data, has proven to be an effective approach for improving accuracy and robustness in complex scenarios. In this study, we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture, along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction. Specifically, we proposed a novel transformer-based audio-visual multimodal fusion (TMF) framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound. Initially, a unimodal self-attention enhancement (USE) module was employed to augment video and audio features with global contextual information. Subsequently, we developed an audio-visual interaction enhancement (AVIE) module to compress relevant information and reduce noise using the information bottleneck principle. Moreover, we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality. Finally, we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information, while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow. Our results demonstrate that the proposed method achieves an accuracy of 98.42 % for general sow nursing behaviour and 94.37 % for fine-grained nursing behaviour, including nursing with and without the calling-to-nurse sound, and non-nursing behaviours. This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness, thereby enhancing management practices in pig farming.