RC-UMT: Multimodal fish feeding intensity classification network with enhanced feature extraction capability in aquaculture

IF 4.3 2区农林科学 Q2 AGRICULTURAL ENGINEERING

Aquacultural Engineering Pub Date : 2025-09-20 DOI:10.1016/j.aquaeng.2025.102639

Shantan Li , Xinting Yang , Zhitao Huang , Tingting Fu , Pingchuan Ma , Dejun Feng , Chao Zhou

{"title":"RC-UMT: Multimodal fish feeding intensity classification network with enhanced feature extraction capability in aquaculture","authors":"Shantan Li , Xinting Yang , Zhitao Huang , Tingting Fu , Pingchuan Ma , Dejun Feng , Chao Zhou","doi":"10.1016/j.aquaeng.2025.102639","DOIUrl":null,"url":null,"abstract":"<div><div>Quantifying fish feeding intensity in real-time is crucial for devising scientific feeding strategies and improving production efficiency. Unimodal fish feeding intensity recognition tasks, such as audio or video, often fail to fully reflect the global characteristics, and lead to unreliable results. To address the limitations, a video and audio fusion model Reinforced Cross-modal Uni-Modal Teacher (RC-UMT) is introduced to evaluate fish feeding intensity classification, which can categorise feeding intensity into four different categories. The specific implementation is as follows: Firstly, the backbone in the original Uni-Modal Teacher (UMT) model is replaced with a Res2Net network, by weighting spatial coordinates of input feature maps, the Coordinate Attention (CA) mechanism in the Res2Net network enhances the ability to capture key multi-frequency features in audio signals. Secondly, RepViT, a lightweight convolutional network inspired by Vision Transformer principles, is introduced to the UMT, which can extract both global and local visual features simultaneously, thereby improving multi-level semantic representation. Finally, an affine transformation matrix with manifold preservation is introduced to enhance the fusion process of the original UMT, which preserves the manifold structure of original modal features, thereby achieving more efficient cross-modal fusion. Results from the experimentation indicate that the RC-UMT model achieves a classification accuracy of 93 %, outperforming the original UMT model by 7 % while reducing parameters by 25.41 %. Compared to audio-only and video-only modalities, the accuracy improves by 7 % and 6.5 %, respectively. Therefore, the proposed video-audio multimodal model enables real-time, high-precision feeding intensity classification, providing technical support for improving smart feeding equipment.</div></div>","PeriodicalId":8120,"journal":{"name":"Aquacultural Engineering","volume":"112 ","pages":"Article 102639"},"PeriodicalIF":4.3000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquacultural Engineering","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0144860925001281","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Quantifying fish feeding intensity in real-time is crucial for devising scientific feeding strategies and improving production efficiency. Unimodal fish feeding intensity recognition tasks, such as audio or video, often fail to fully reflect the global characteristics, and lead to unreliable results. To address the limitations, a video and audio fusion model Reinforced Cross-modal Uni-Modal Teacher (RC-UMT) is introduced to evaluate fish feeding intensity classification, which can categorise feeding intensity into four different categories. The specific implementation is as follows: Firstly, the backbone in the original Uni-Modal Teacher (UMT) model is replaced with a Res2Net network, by weighting spatial coordinates of input feature maps, the Coordinate Attention (CA) mechanism in the Res2Net network enhances the ability to capture key multi-frequency features in audio signals. Secondly, RepViT, a lightweight convolutional network inspired by Vision Transformer principles, is introduced to the UMT, which can extract both global and local visual features simultaneously, thereby improving multi-level semantic representation. Finally, an affine transformation matrix with manifold preservation is introduced to enhance the fusion process of the original UMT, which preserves the manifold structure of original modal features, thereby achieving more efficient cross-modal fusion. Results from the experimentation indicate that the RC-UMT model achieves a classification accuracy of 93 %, outperforming the original UMT model by 7 % while reducing parameters by 25.41 %. Compared to audio-only and video-only modalities, the accuracy improves by 7 % and 6.5 %, respectively. Therefore, the proposed video-audio multimodal model enables real-time, high-precision feeding intensity classification, providing technical support for improving smart feeding equipment.

查看原文本刊更多论文

RC-UMT：水产养殖中具有增强特征提取能力的多模式鱼类摄食强度分类网络

实时量化鱼类摄食强度对于制定科学的摄食策略和提高生产效率至关重要。单模态鱼类摄食强度识别任务，如音频或视频，往往不能充分反映全局特征，导致结果不可靠。为了解决这一问题，引入了一种视频和音频融合模型——增强跨模态单模态教师模型（RC-UMT）来评估鱼类的摄食强度分类，该模型可以将摄食强度分为四个不同的类别。具体实现如下：首先，将原有UMT模型中的主干替换为Res2Net网络，通过对输入特征映射的空间坐标进行加权，Res2Net网络中的坐标注意（CA）机制增强了对音频信号中关键多频特征的捕获能力。其次，将基于Vision Transformer原理的轻量级卷积网络RepViT引入到UMT中，该网络可以同时提取全局和局部视觉特征，从而提高了多层语义表示；最后，引入具有流形保留的仿射变换矩阵来增强原始UMT的融合过程，保留了原始模态特征的流形结构，从而实现更高效的跨模态融合。实验结果表明，RC-UMT模型的分类准确率为93 %，比原始的UMT模型提高了7 %，同时减少了25.41 %的参数。与纯音频和纯视频模式相比，准确度分别提高了7 %和6.5 %。因此，本文提出的视频-音频多模态模型能够实现实时、高精度的喂料强度分类，为改进智能喂料设备提供技术支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Aquacultural Engineering 农林科学-农业工程

CiteScore

8.60

自引率

10.00%

发文量

审稿时长

>24 weeks

期刊介绍： Aquacultural Engineering is concerned with the design and development of effective aquacultural systems for marine and freshwater facilities. The journal aims to apply the knowledge gained from basic research which potentially can be translated into commercial operations. Problems of scale-up and application of research data involve many parameters, both physical and biological, making it difficult to anticipate the interaction between the unit processes and the cultured animals. Aquacultural Engineering aims to develop this bioengineering interface for aquaculture and welcomes contributions in the following areas: – Engineering and design of aquaculture facilities – Engineering-based research studies – Construction experience and techniques – In-service experience, commissioning, operation – Materials selection and their uses – Quantification of biological data and constraints