Jianping Liu , Jialu Xing , Guomin Zhou , Jian Wang , Lulu Sun , Xi Chen
{"title":"将大型模型转移到作物病虫害识别——参数有效微调的跨模态统一框架","authors":"Jianping Liu , Jialu Xing , Guomin Zhou , Jian Wang , Lulu Sun , Xi Chen","doi":"10.1016/j.compag.2025.110661","DOIUrl":null,"url":null,"abstract":"<div><div>Crop pest recognition is an important direction in agricultural research, which is of great significance for improving crop yield and scientifically classifying pests for precision agriculture. Traditional deep learning pest recognition usually trains proprietary models on single categories and scenes as well as unimodal information, achieving excellent performance. However, this scheme has a weak foundation of general knowledge, insufficient transferability, and unimodal information has limited effect on the recognition of pest background and different life stages. In recent years, transferring the general knowledge of Large pre-trained models (LPTM) to specific domains through full fine-tuning has become an effective solution. However, full fine-tuning requires massive data and operator resources to effectively adapt all parameters. Therefore, this paper proposes a cross-modal parameter efficient fine-tuning (PEFT) unified framework for crop pest recognition with the multimodal large model CLIP as the pre-training model. The proposed method employs CLIP as the encoder for both image and text modalities, introducing the Dual-<span><math><msup><mrow><mrow><mo>(</mo><mtext>PAL</mtext><mo>)</mo></mrow></mrow><mrow><mtext>G</mtext></mrow></msup></math></span> model. Firstly, learnable Prompt sequences are embedded in the input or hidden layers of the encoder. Secondly, multimodal LoRA is parallelly replaced in the dimension expansion layer of the fully connected layer. Then, the Gate unit integrates three PEFT methods—Prompt, Adapter, and LoRA, to enhance learning ability. We designed the GSC-Adapter and the parameter-efficient Light-GCS-Adapter for cross-modal semantic information fusion. To verify the effectiveness of the method, we conducted a large number of experiments on public datasets for crop pest recognition. Firstly, on the public dataset IP102 (for fine-grained recognition), we surpassed ViT and Swin Transformer with 66% of the sample size. In wolfberry pest dataset WPIT9K, using only about 15% of the sample size, it surpasses the previous state-of-the-art model ITF-WPI, achieving 98% accuracy. It also shows excellent performance on eight general tasks. This study provides a new technical solution for the field of agricultural pest recognition . This solution can efficiently transfer the general knowledge of multimodal LPTM to the specific pest recognition field under the condition of a few samples, with only a minimal number of parameters introduced. At the same time, this method has universality in cross-modal recognition tasks. <em>The code for this study will be posted on GitHub (</em><span><span><em>https://github.com/VcRenOne/Dual--PAL-G</em></span><svg><path></path></svg></span><em>)</em></div></div>","PeriodicalId":50627,"journal":{"name":"Computers and Electronics in Agriculture","volume":"237 ","pages":"Article 110661"},"PeriodicalIF":8.9000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transfer large models to crop pest recognition—A cross-modal unified framework for parameters efficient fine-tuning\",\"authors\":\"Jianping Liu , Jialu Xing , Guomin Zhou , Jian Wang , Lulu Sun , Xi Chen\",\"doi\":\"10.1016/j.compag.2025.110661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Crop pest recognition is an important direction in agricultural research, which is of great significance for improving crop yield and scientifically classifying pests for precision agriculture. Traditional deep learning pest recognition usually trains proprietary models on single categories and scenes as well as unimodal information, achieving excellent performance. However, this scheme has a weak foundation of general knowledge, insufficient transferability, and unimodal information has limited effect on the recognition of pest background and different life stages. In recent years, transferring the general knowledge of Large pre-trained models (LPTM) to specific domains through full fine-tuning has become an effective solution. However, full fine-tuning requires massive data and operator resources to effectively adapt all parameters. Therefore, this paper proposes a cross-modal parameter efficient fine-tuning (PEFT) unified framework for crop pest recognition with the multimodal large model CLIP as the pre-training model. The proposed method employs CLIP as the encoder for both image and text modalities, introducing the Dual-<span><math><msup><mrow><mrow><mo>(</mo><mtext>PAL</mtext><mo>)</mo></mrow></mrow><mrow><mtext>G</mtext></mrow></msup></math></span> model. Firstly, learnable Prompt sequences are embedded in the input or hidden layers of the encoder. Secondly, multimodal LoRA is parallelly replaced in the dimension expansion layer of the fully connected layer. Then, the Gate unit integrates three PEFT methods—Prompt, Adapter, and LoRA, to enhance learning ability. We designed the GSC-Adapter and the parameter-efficient Light-GCS-Adapter for cross-modal semantic information fusion. To verify the effectiveness of the method, we conducted a large number of experiments on public datasets for crop pest recognition. Firstly, on the public dataset IP102 (for fine-grained recognition), we surpassed ViT and Swin Transformer with 66% of the sample size. In wolfberry pest dataset WPIT9K, using only about 15% of the sample size, it surpasses the previous state-of-the-art model ITF-WPI, achieving 98% accuracy. It also shows excellent performance on eight general tasks. This study provides a new technical solution for the field of agricultural pest recognition . This solution can efficiently transfer the general knowledge of multimodal LPTM to the specific pest recognition field under the condition of a few samples, with only a minimal number of parameters introduced. At the same time, this method has universality in cross-modal recognition tasks. <em>The code for this study will be posted on GitHub (</em><span><span><em>https://github.com/VcRenOne/Dual--PAL-G</em></span><svg><path></path></svg></span><em>)</em></div></div>\",\"PeriodicalId\":50627,\"journal\":{\"name\":\"Computers and Electronics in Agriculture\",\"volume\":\"237 \",\"pages\":\"Article 110661\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers and Electronics in Agriculture\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0168169925007677\",\"RegionNum\":1,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers and Electronics in Agriculture","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0168169925007677","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}
Transfer large models to crop pest recognition—A cross-modal unified framework for parameters efficient fine-tuning
Crop pest recognition is an important direction in agricultural research, which is of great significance for improving crop yield and scientifically classifying pests for precision agriculture. Traditional deep learning pest recognition usually trains proprietary models on single categories and scenes as well as unimodal information, achieving excellent performance. However, this scheme has a weak foundation of general knowledge, insufficient transferability, and unimodal information has limited effect on the recognition of pest background and different life stages. In recent years, transferring the general knowledge of Large pre-trained models (LPTM) to specific domains through full fine-tuning has become an effective solution. However, full fine-tuning requires massive data and operator resources to effectively adapt all parameters. Therefore, this paper proposes a cross-modal parameter efficient fine-tuning (PEFT) unified framework for crop pest recognition with the multimodal large model CLIP as the pre-training model. The proposed method employs CLIP as the encoder for both image and text modalities, introducing the Dual- model. Firstly, learnable Prompt sequences are embedded in the input or hidden layers of the encoder. Secondly, multimodal LoRA is parallelly replaced in the dimension expansion layer of the fully connected layer. Then, the Gate unit integrates three PEFT methods—Prompt, Adapter, and LoRA, to enhance learning ability. We designed the GSC-Adapter and the parameter-efficient Light-GCS-Adapter for cross-modal semantic information fusion. To verify the effectiveness of the method, we conducted a large number of experiments on public datasets for crop pest recognition. Firstly, on the public dataset IP102 (for fine-grained recognition), we surpassed ViT and Swin Transformer with 66% of the sample size. In wolfberry pest dataset WPIT9K, using only about 15% of the sample size, it surpasses the previous state-of-the-art model ITF-WPI, achieving 98% accuracy. It also shows excellent performance on eight general tasks. This study provides a new technical solution for the field of agricultural pest recognition . This solution can efficiently transfer the general knowledge of multimodal LPTM to the specific pest recognition field under the condition of a few samples, with only a minimal number of parameters introduced. At the same time, this method has universality in cross-modal recognition tasks. The code for this study will be posted on GitHub (https://github.com/VcRenOne/Dual--PAL-G)
期刊介绍:
Computers and Electronics in Agriculture provides international coverage of advancements in computer hardware, software, electronic instrumentation, and control systems applied to agricultural challenges. Encompassing agronomy, horticulture, forestry, aquaculture, and animal farming, the journal publishes original papers, reviews, and applications notes. It explores the use of computers and electronics in plant or animal agricultural production, covering topics like agricultural soils, water, pests, controlled environments, and waste. The scope extends to on-farm post-harvest operations and relevant technologies, including artificial intelligence, sensors, machine vision, robotics, networking, and simulation modeling. Its companion journal, Smart Agricultural Technology, continues the focus on smart applications in production agriculture.