Yuxin Yang , Hong Yu , Xin Zhang , Peng Zhang , Wan Tu , Lishuai Gu
{"title":"基于视听多模态交互融合网络的鱼类行为识别","authors":"Yuxin Yang , Hong Yu , Xin Zhang , Peng Zhang , Wan Tu , Lishuai Gu","doi":"10.1016/j.aquaeng.2024.102471","DOIUrl":null,"url":null,"abstract":"<div><div>In light of the challenges imposed by fish behavior recognition, which arise from environmental noise and dim lighting in aquaculture environments and adversely affect the effectiveness of unimodal recognition methods based on either sound or visual cues, this paper proposes a fish behavior recognition model, Mul-SEResNet50, based on the fusion of audio and visual information. To address issues such as image blurring and indistinct sounds in aquaculture environments, which hinder the effectiveness of multimodal fusion and complementary modalities, a multimodal interaction fusion (MIF) module is introduced. This module integrates audio-visual modalities at multiple stages to achieve a more comprehensive joint feature representation. To enhance complementarity during the fusion process, we designed a U-shaped bilinear fusion structure to fully utilize multimodal information, capture cross-modal associations, and extract high-level features. Furthermore, to address the potential loss of key features, a temporal aggregation and pooling (TAP) layer is introduced to preserve more fine-grained features by extracting both the maximum and average values within pooling regions. To validate the effectiveness of the proposed model, both ablation experiments and comparative experiments are conducted. The results demonstrate that Mul-SEResNet50 achieves a 5.04 % accuracy improvement over SEResNet50 without sacrificing detection speed. Compared to the state-of-the-art U-FusionNet-ResNet50 +SENet model, Mul-SEResNet50 achieves accuracy and F1 score improvements of 0.47 % and 1.32 %, respectively. These findings confirm the efficacy of the proposed model in terms of accurately recognizing fish behavior, facilitating the precise monitoring of fish behavior.</div></div>","PeriodicalId":8120,"journal":{"name":"Aquacultural Engineering","volume":"107 ","pages":"Article 102471"},"PeriodicalIF":3.6000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fish behavior recognition based on an audio-visual multimodal interactive fusion network\",\"authors\":\"Yuxin Yang , Hong Yu , Xin Zhang , Peng Zhang , Wan Tu , Lishuai Gu\",\"doi\":\"10.1016/j.aquaeng.2024.102471\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In light of the challenges imposed by fish behavior recognition, which arise from environmental noise and dim lighting in aquaculture environments and adversely affect the effectiveness of unimodal recognition methods based on either sound or visual cues, this paper proposes a fish behavior recognition model, Mul-SEResNet50, based on the fusion of audio and visual information. To address issues such as image blurring and indistinct sounds in aquaculture environments, which hinder the effectiveness of multimodal fusion and complementary modalities, a multimodal interaction fusion (MIF) module is introduced. This module integrates audio-visual modalities at multiple stages to achieve a more comprehensive joint feature representation. To enhance complementarity during the fusion process, we designed a U-shaped bilinear fusion structure to fully utilize multimodal information, capture cross-modal associations, and extract high-level features. Furthermore, to address the potential loss of key features, a temporal aggregation and pooling (TAP) layer is introduced to preserve more fine-grained features by extracting both the maximum and average values within pooling regions. To validate the effectiveness of the proposed model, both ablation experiments and comparative experiments are conducted. The results demonstrate that Mul-SEResNet50 achieves a 5.04 % accuracy improvement over SEResNet50 without sacrificing detection speed. Compared to the state-of-the-art U-FusionNet-ResNet50 +SENet model, Mul-SEResNet50 achieves accuracy and F1 score improvements of 0.47 % and 1.32 %, respectively. These findings confirm the efficacy of the proposed model in terms of accurately recognizing fish behavior, facilitating the precise monitoring of fish behavior.</div></div>\",\"PeriodicalId\":8120,\"journal\":{\"name\":\"Aquacultural Engineering\",\"volume\":\"107 \",\"pages\":\"Article 102471\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Aquacultural Engineering\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0144860924000827\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AGRICULTURAL ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquacultural Engineering","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0144860924000827","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
水产养殖环境中的环境噪声和昏暗光线给鱼类行为识别带来了挑战,影响了基于声音或视觉线索的单模态识别方法的有效性。水产养殖环境中的图像模糊和声音不清晰等问题阻碍了多模态融合和互补模态的有效性,为了解决这些问题,本文引入了多模态交互融合(MIF)模块。该模块在多个阶段整合视听模式,以实现更全面的联合特征表示。为了增强融合过程中的互补性,我们设计了一种 U 型双线性融合结构,以充分利用多模态信息、捕捉跨模态关联并提取高级特征。此外,为了解决关键特征可能丢失的问题,我们还引入了时间聚合和池化(TAP)层,通过提取池化区域内的最大值和平均值来保留更精细的特征。为了验证所提模型的有效性,我们进行了消融实验和对比实验。结果表明,Mul-SEResNet50 比 SEResNet50 的准确率提高了 5.04%,而且没有牺牲检测速度。与最先进的 U-FusionNet-ResNet50 +SENet 模型相比,Mul-SEResNet50 的准确率和 F1 分数分别提高了 0.47 % 和 1.32 %。这些研究结果证实了所提模型在准确识别鱼类行为方面的功效,有助于对鱼类行为进行精确监测。
Fish behavior recognition based on an audio-visual multimodal interactive fusion network
In light of the challenges imposed by fish behavior recognition, which arise from environmental noise and dim lighting in aquaculture environments and adversely affect the effectiveness of unimodal recognition methods based on either sound or visual cues, this paper proposes a fish behavior recognition model, Mul-SEResNet50, based on the fusion of audio and visual information. To address issues such as image blurring and indistinct sounds in aquaculture environments, which hinder the effectiveness of multimodal fusion and complementary modalities, a multimodal interaction fusion (MIF) module is introduced. This module integrates audio-visual modalities at multiple stages to achieve a more comprehensive joint feature representation. To enhance complementarity during the fusion process, we designed a U-shaped bilinear fusion structure to fully utilize multimodal information, capture cross-modal associations, and extract high-level features. Furthermore, to address the potential loss of key features, a temporal aggregation and pooling (TAP) layer is introduced to preserve more fine-grained features by extracting both the maximum and average values within pooling regions. To validate the effectiveness of the proposed model, both ablation experiments and comparative experiments are conducted. The results demonstrate that Mul-SEResNet50 achieves a 5.04 % accuracy improvement over SEResNet50 without sacrificing detection speed. Compared to the state-of-the-art U-FusionNet-ResNet50 +SENet model, Mul-SEResNet50 achieves accuracy and F1 score improvements of 0.47 % and 1.32 %, respectively. These findings confirm the efficacy of the proposed model in terms of accurately recognizing fish behavior, facilitating the precise monitoring of fish behavior.
期刊介绍:
Aquacultural Engineering is concerned with the design and development of effective aquacultural systems for marine and freshwater facilities. The journal aims to apply the knowledge gained from basic research which potentially can be translated into commercial operations.
Problems of scale-up and application of research data involve many parameters, both physical and biological, making it difficult to anticipate the interaction between the unit processes and the cultured animals. Aquacultural Engineering aims to develop this bioengineering interface for aquaculture and welcomes contributions in the following areas:
– Engineering and design of aquaculture facilities
– Engineering-based research studies
– Construction experience and techniques
– In-service experience, commissioning, operation
– Materials selection and their uses
– Quantification of biological data and constraints