CLIP-FSSC：基于自然语言监督的可转移鱼虾物种分类视觉模型

IF 3.6 2区农林科学 Q2 AGRICULTURAL ENGINEERING

Aquacultural Engineering Pub Date : 2024-08-17 DOI:10.1016/j.aquaeng.2024.102460

Kanyuan Dai , Ji Shao , Bo Gong , Ling Jing , Yingyi Chen

{"title":"CLIP-FSSC：基于自然语言监督的可转移鱼虾物种分类视觉模型","authors":"Kanyuan Dai , Ji Shao , Bo Gong , Ling Jing , Yingyi Chen","doi":"10.1016/j.aquaeng.2024.102460","DOIUrl":null,"url":null,"abstract":"<div><p>Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77 %, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00 % for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.</p></div>","PeriodicalId":8120,"journal":{"name":"Aquacultural Engineering","volume":"107 ","pages":"Article 102460"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP-FSSC: A transferable visual model for fish and shrimp species classification based on natural language supervision\",\"authors\":\"Kanyuan Dai , Ji Shao , Bo Gong , Ling Jing , Yingyi Chen\",\"doi\":\"10.1016/j.aquaeng.2024.102460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77 %, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00 % for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.</p></div>\",\"PeriodicalId\":8120,\"journal\":{\"name\":\"Aquacultural Engineering\",\"volume\":\"107 \",\"pages\":\"Article 102460\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Aquacultural Engineering\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0144860924000712\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AGRICULTURAL ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquacultural Engineering","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0144860924000712","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

鱼虾物种分类是水产养殖领域的实际需要。传统的分类方法包括提取单张图像的模态特征并在下游数据集上进行训练。然而，这种方法的缺点是需要人工标注图像数据和大量的训练时间。为了解决这些问题，本文介绍了一种名为 CLIP-FSSC （用于鱼虾物种分类的对比语言-图像预训练）的方法，利用预训练模型进行零镜头预测。所提出的方法旨在利用多模态预训练模型对水产养殖领域的鱼虾物种进行分类，该模型利用语义文本描述作为迁移学习的图像监督信号。在下游鱼类数据集中，我们使用草鱼、鲤鱼和鲢鱼三种鱼类的自然语言标签。我们使用转换器提取文本类别特征，并比较了基于 CLIP 的三种不同图像模态骨干模型（视觉转换器、Resnet50 和 Resnet101）的结果。我们将这些模型的性能与之前表现良好的方法进行了比较。在对三种鱼类样本进行零点预测后，我们的分类准确率与在下游鱼类数据集上训练的模型相近，甚至更高。实验结果表明，我们的准确率达到了 98.77%，而且不需要新的训练过程。这证明，使用语义文本模态作为图像模态的标签可以有效地对鱼类进行分类。为了证明这种方法在水产养殖领域的其他物种上的有效性，我们收集了两组虾类数据--对虾和梭子蟹。通过零点预测，我们对这两类对虾数据集的分类准确率最高，达到 92.00%。总之，我们的研究结果表明，使用带有语义文本描述的多模态预训练模型作为迁移学习的图像监督信号，可以有效地对鱼类和虾类进行高精度分类，同时减少了人工标注的需要和训练时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CLIP-FSSC: A transferable visual model for fish and shrimp species classification based on natural language supervision

Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77 %, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00 % for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Aquacultural Engineering 农林科学-农业工程

CiteScore

8.60

自引率

10.00%

发文量

审稿时长

>24 weeks

期刊介绍： Aquacultural Engineering is concerned with the design and development of effective aquacultural systems for marine and freshwater facilities. The journal aims to apply the knowledge gained from basic research which potentially can be translated into commercial operations. Problems of scale-up and application of research data involve many parameters, both physical and biological, making it difficult to anticipate the interaction between the unit processes and the cultured animals. Aquacultural Engineering aims to develop this bioengineering interface for aquaculture and welcomes contributions in the following areas: – Engineering and design of aquaculture facilities – Engineering-based research studies – Construction experience and techniques – In-service experience, commissioning, operation – Materials selection and their uses – Quantification of biological data and constraints