Shantan Li , Xinting Yang , Zhitao Huang , Tingting Fu , Pingchuan Ma , Dejun Feng , Chao Zhou
{"title":"RC-UMT: Multimodal fish feeding intensity classification network with enhanced feature extraction capability in aquaculture","authors":"Shantan Li , Xinting Yang , Zhitao Huang , Tingting Fu , Pingchuan Ma , Dejun Feng , Chao Zhou","doi":"10.1016/j.aquaeng.2025.102639","DOIUrl":null,"url":null,"abstract":"<div><div>Quantifying fish feeding intensity in real-time is crucial for devising scientific feeding strategies and improving production efficiency. Unimodal fish feeding intensity recognition tasks, such as audio or video, often fail to fully reflect the global characteristics, and lead to unreliable results. To address the limitations, a video and audio fusion model Reinforced Cross-modal Uni-Modal Teacher (RC-UMT) is introduced to evaluate fish feeding intensity classification, which can categorise feeding intensity into four different categories. The specific implementation is as follows: Firstly, the backbone in the original Uni-Modal Teacher (UMT) model is replaced with a Res2Net network, by weighting spatial coordinates of input feature maps, the Coordinate Attention (CA) mechanism in the Res2Net network enhances the ability to capture key multi-frequency features in audio signals. Secondly, RepViT, a lightweight convolutional network inspired by Vision Transformer principles, is introduced to the UMT, which can extract both global and local visual features simultaneously, thereby improving multi-level semantic representation. Finally, an affine transformation matrix with manifold preservation is introduced to enhance the fusion process of the original UMT, which preserves the manifold structure of original modal features, thereby achieving more efficient cross-modal fusion. Results from the experimentation indicate that the RC-UMT model achieves a classification accuracy of 93 %, outperforming the original UMT model by 7 % while reducing parameters by 25.41 %. Compared to audio-only and video-only modalities, the accuracy improves by 7 % and 6.5 %, respectively. Therefore, the proposed video-audio multimodal model enables real-time, high-precision feeding intensity classification, providing technical support for improving smart feeding equipment.</div></div>","PeriodicalId":8120,"journal":{"name":"Aquacultural Engineering","volume":"112 ","pages":"Article 102639"},"PeriodicalIF":4.3000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquacultural Engineering","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0144860925001281","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Quantifying fish feeding intensity in real-time is crucial for devising scientific feeding strategies and improving production efficiency. Unimodal fish feeding intensity recognition tasks, such as audio or video, often fail to fully reflect the global characteristics, and lead to unreliable results. To address the limitations, a video and audio fusion model Reinforced Cross-modal Uni-Modal Teacher (RC-UMT) is introduced to evaluate fish feeding intensity classification, which can categorise feeding intensity into four different categories. The specific implementation is as follows: Firstly, the backbone in the original Uni-Modal Teacher (UMT) model is replaced with a Res2Net network, by weighting spatial coordinates of input feature maps, the Coordinate Attention (CA) mechanism in the Res2Net network enhances the ability to capture key multi-frequency features in audio signals. Secondly, RepViT, a lightweight convolutional network inspired by Vision Transformer principles, is introduced to the UMT, which can extract both global and local visual features simultaneously, thereby improving multi-level semantic representation. Finally, an affine transformation matrix with manifold preservation is introduced to enhance the fusion process of the original UMT, which preserves the manifold structure of original modal features, thereby achieving more efficient cross-modal fusion. Results from the experimentation indicate that the RC-UMT model achieves a classification accuracy of 93 %, outperforming the original UMT model by 7 % while reducing parameters by 25.41 %. Compared to audio-only and video-only modalities, the accuracy improves by 7 % and 6.5 %, respectively. Therefore, the proposed video-audio multimodal model enables real-time, high-precision feeding intensity classification, providing technical support for improving smart feeding equipment.
期刊介绍:
Aquacultural Engineering is concerned with the design and development of effective aquacultural systems for marine and freshwater facilities. The journal aims to apply the knowledge gained from basic research which potentially can be translated into commercial operations.
Problems of scale-up and application of research data involve many parameters, both physical and biological, making it difficult to anticipate the interaction between the unit processes and the cultured animals. Aquacultural Engineering aims to develop this bioengineering interface for aquaculture and welcomes contributions in the following areas:
– Engineering and design of aquaculture facilities
– Engineering-based research studies
– Construction experience and techniques
– In-service experience, commissioning, operation
– Materials selection and their uses
– Quantification of biological data and constraints