Yuxin Yang , Hong Yu , Xin Zhang , Peng Zhang , Wan Tu , Lishuai Gu
{"title":"Fish behavior recognition based on an audio-visual multimodal interactive fusion network","authors":"Yuxin Yang , Hong Yu , Xin Zhang , Peng Zhang , Wan Tu , Lishuai Gu","doi":"10.1016/j.aquaeng.2024.102471","DOIUrl":null,"url":null,"abstract":"<div><div>In light of the challenges imposed by fish behavior recognition, which arise from environmental noise and dim lighting in aquaculture environments and adversely affect the effectiveness of unimodal recognition methods based on either sound or visual cues, this paper proposes a fish behavior recognition model, Mul-SEResNet50, based on the fusion of audio and visual information. To address issues such as image blurring and indistinct sounds in aquaculture environments, which hinder the effectiveness of multimodal fusion and complementary modalities, a multimodal interaction fusion (MIF) module is introduced. This module integrates audio-visual modalities at multiple stages to achieve a more comprehensive joint feature representation. To enhance complementarity during the fusion process, we designed a U-shaped bilinear fusion structure to fully utilize multimodal information, capture cross-modal associations, and extract high-level features. Furthermore, to address the potential loss of key features, a temporal aggregation and pooling (TAP) layer is introduced to preserve more fine-grained features by extracting both the maximum and average values within pooling regions. To validate the effectiveness of the proposed model, both ablation experiments and comparative experiments are conducted. The results demonstrate that Mul-SEResNet50 achieves a 5.04 % accuracy improvement over SEResNet50 without sacrificing detection speed. Compared to the state-of-the-art U-FusionNet-ResNet50 +SENet model, Mul-SEResNet50 achieves accuracy and F1 score improvements of 0.47 % and 1.32 %, respectively. These findings confirm the efficacy of the proposed model in terms of accurately recognizing fish behavior, facilitating the precise monitoring of fish behavior.</div></div>","PeriodicalId":8120,"journal":{"name":"Aquacultural Engineering","volume":"107 ","pages":"Article 102471"},"PeriodicalIF":3.6000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquacultural Engineering","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0144860924000827","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
In light of the challenges imposed by fish behavior recognition, which arise from environmental noise and dim lighting in aquaculture environments and adversely affect the effectiveness of unimodal recognition methods based on either sound or visual cues, this paper proposes a fish behavior recognition model, Mul-SEResNet50, based on the fusion of audio and visual information. To address issues such as image blurring and indistinct sounds in aquaculture environments, which hinder the effectiveness of multimodal fusion and complementary modalities, a multimodal interaction fusion (MIF) module is introduced. This module integrates audio-visual modalities at multiple stages to achieve a more comprehensive joint feature representation. To enhance complementarity during the fusion process, we designed a U-shaped bilinear fusion structure to fully utilize multimodal information, capture cross-modal associations, and extract high-level features. Furthermore, to address the potential loss of key features, a temporal aggregation and pooling (TAP) layer is introduced to preserve more fine-grained features by extracting both the maximum and average values within pooling regions. To validate the effectiveness of the proposed model, both ablation experiments and comparative experiments are conducted. The results demonstrate that Mul-SEResNet50 achieves a 5.04 % accuracy improvement over SEResNet50 without sacrificing detection speed. Compared to the state-of-the-art U-FusionNet-ResNet50 +SENet model, Mul-SEResNet50 achieves accuracy and F1 score improvements of 0.47 % and 1.32 %, respectively. These findings confirm the efficacy of the proposed model in terms of accurately recognizing fish behavior, facilitating the precise monitoring of fish behavior.
期刊介绍:
Aquacultural Engineering is concerned with the design and development of effective aquacultural systems for marine and freshwater facilities. The journal aims to apply the knowledge gained from basic research which potentially can be translated into commercial operations.
Problems of scale-up and application of research data involve many parameters, both physical and biological, making it difficult to anticipate the interaction between the unit processes and the cultured animals. Aquacultural Engineering aims to develop this bioengineering interface for aquaculture and welcomes contributions in the following areas:
– Engineering and design of aquaculture facilities
– Engineering-based research studies
– Construction experience and techniques
– In-service experience, commissioning, operation
– Materials selection and their uses
– Quantification of biological data and constraints