Kanyuan Dai , Ji Shao , Bo Gong , Ling Jing , Yingyi Chen
{"title":"CLIP-FSSC:基于自然语言监督的可转移鱼虾物种分类视觉模型","authors":"Kanyuan Dai , Ji Shao , Bo Gong , Ling Jing , Yingyi Chen","doi":"10.1016/j.aquaeng.2024.102460","DOIUrl":null,"url":null,"abstract":"<div><p>Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77 %, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00 % for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.</p></div>","PeriodicalId":8120,"journal":{"name":"Aquacultural Engineering","volume":"107 ","pages":"Article 102460"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP-FSSC: A transferable visual model for fish and shrimp species classification based on natural language supervision\",\"authors\":\"Kanyuan Dai , Ji Shao , Bo Gong , Ling Jing , Yingyi Chen\",\"doi\":\"10.1016/j.aquaeng.2024.102460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77 %, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00 % for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.</p></div>\",\"PeriodicalId\":8120,\"journal\":{\"name\":\"Aquacultural Engineering\",\"volume\":\"107 \",\"pages\":\"Article 102460\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Aquacultural Engineering\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0144860924000712\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AGRICULTURAL ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquacultural Engineering","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0144860924000712","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}
CLIP-FSSC: A transferable visual model for fish and shrimp species classification based on natural language supervision
Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77 %, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00 % for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.
期刊介绍:
Aquacultural Engineering is concerned with the design and development of effective aquacultural systems for marine and freshwater facilities. The journal aims to apply the knowledge gained from basic research which potentially can be translated into commercial operations.
Problems of scale-up and application of research data involve many parameters, both physical and biological, making it difficult to anticipate the interaction between the unit processes and the cultured animals. Aquacultural Engineering aims to develop this bioengineering interface for aquaculture and welcomes contributions in the following areas:
– Engineering and design of aquaculture facilities
– Engineering-based research studies
– Construction experience and techniques
– In-service experience, commissioning, operation
– Materials selection and their uses
– Quantification of biological data and constraints