将大型模型转移到作物病虫害识别——参数有效微调的跨模态统一框架

IF 8.9 1区农林科学 Q1 AGRICULTURE, MULTIDISCIPLINARY

Computers and Electronics in Agriculture Pub Date : 2025-06-27 DOI:10.1016/j.compag.2025.110661

Jianping Liu , Jialu Xing , Guomin Zhou , Jian Wang , Lulu Sun , Xi Chen

{"title":"将大型模型转移到作物病虫害识别——参数有效微调的跨模态统一框架","authors":"Jianping Liu , Jialu Xing , Guomin Zhou , Jian Wang , Lulu Sun , Xi Chen","doi":"10.1016/j.compag.2025.110661","DOIUrl":null,"url":null,"abstract":"<div><div>Crop pest recognition is an important direction in agricultural research, which is of great significance for improving crop yield and scientifically classifying pests for precision agriculture. Traditional deep learning pest recognition usually trains proprietary models on single categories and scenes as well as unimodal information, achieving excellent performance. However, this scheme has a weak foundation of general knowledge, insufficient transferability, and unimodal information has limited effect on the recognition of pest background and different life stages. In recent years, transferring the general knowledge of Large pre-trained models (LPTM) to specific domains through full fine-tuning has become an effective solution. However, full fine-tuning requires massive data and operator resources to effectively adapt all parameters. Therefore, this paper proposes a cross-modal parameter efficient fine-tuning (PEFT) unified framework for crop pest recognition with the multimodal large model CLIP as the pre-training model. The proposed method employs CLIP as the encoder for both image and text modalities, introducing the Dual-<span><math><msup><mrow><mrow><mo>(</mo><mtext>PAL</mtext><mo>)</mo></mrow></mrow><mrow><mtext>G</mtext></mrow></msup></math></span> model. Firstly, learnable Prompt sequences are embedded in the input or hidden layers of the encoder. Secondly, multimodal LoRA is parallelly replaced in the dimension expansion layer of the fully connected layer. Then, the Gate unit integrates three PEFT methods—Prompt, Adapter, and LoRA, to enhance learning ability. We designed the GSC-Adapter and the parameter-efficient Light-GCS-Adapter for cross-modal semantic information fusion. To verify the effectiveness of the method, we conducted a large number of experiments on public datasets for crop pest recognition. Firstly, on the public dataset IP102 (for fine-grained recognition), we surpassed ViT and Swin Transformer with 66% of the sample size. In wolfberry pest dataset WPIT9K, using only about 15% of the sample size, it surpasses the previous state-of-the-art model ITF-WPI, achieving 98% accuracy. It also shows excellent performance on eight general tasks. This study provides a new technical solution for the field of agricultural pest recognition . This solution can efficiently transfer the general knowledge of multimodal LPTM to the specific pest recognition field under the condition of a few samples, with only a minimal number of parameters introduced. At the same time, this method has universality in cross-modal recognition tasks. <em>The code for this study will be posted on GitHub (</em><span><span><em>https://github.com/VcRenOne/Dual--PAL-G</em></span><svg><path></path></svg></span><em>)</em></div></div>","PeriodicalId":50627,"journal":{"name":"Computers and Electronics in Agriculture","volume":"237 ","pages":"Article 110661"},"PeriodicalIF":8.9000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transfer large models to crop pest recognition—A cross-modal unified framework for parameters efficient fine-tuning\",\"authors\":\"Jianping Liu , Jialu Xing , Guomin Zhou , Jian Wang , Lulu Sun , Xi Chen\",\"doi\":\"10.1016/j.compag.2025.110661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Crop pest recognition is an important direction in agricultural research, which is of great significance for improving crop yield and scientifically classifying pests for precision agriculture. Traditional deep learning pest recognition usually trains proprietary models on single categories and scenes as well as unimodal information, achieving excellent performance. However, this scheme has a weak foundation of general knowledge, insufficient transferability, and unimodal information has limited effect on the recognition of pest background and different life stages. In recent years, transferring the general knowledge of Large pre-trained models (LPTM) to specific domains through full fine-tuning has become an effective solution. However, full fine-tuning requires massive data and operator resources to effectively adapt all parameters. Therefore, this paper proposes a cross-modal parameter efficient fine-tuning (PEFT) unified framework for crop pest recognition with the multimodal large model CLIP as the pre-training model. The proposed method employs CLIP as the encoder for both image and text modalities, introducing the Dual-<span><math><msup><mrow><mrow><mo>(</mo><mtext>PAL</mtext><mo>)</mo></mrow></mrow><mrow><mtext>G</mtext></mrow></msup></math></span> model. Firstly, learnable Prompt sequences are embedded in the input or hidden layers of the encoder. Secondly, multimodal LoRA is parallelly replaced in the dimension expansion layer of the fully connected layer. Then, the Gate unit integrates three PEFT methods—Prompt, Adapter, and LoRA, to enhance learning ability. We designed the GSC-Adapter and the parameter-efficient Light-GCS-Adapter for cross-modal semantic information fusion. To verify the effectiveness of the method, we conducted a large number of experiments on public datasets for crop pest recognition. Firstly, on the public dataset IP102 (for fine-grained recognition), we surpassed ViT and Swin Transformer with 66% of the sample size. In wolfberry pest dataset WPIT9K, using only about 15% of the sample size, it surpasses the previous state-of-the-art model ITF-WPI, achieving 98% accuracy. It also shows excellent performance on eight general tasks. This study provides a new technical solution for the field of agricultural pest recognition . This solution can efficiently transfer the general knowledge of multimodal LPTM to the specific pest recognition field under the condition of a few samples, with only a minimal number of parameters introduced. At the same time, this method has universality in cross-modal recognition tasks. <em>The code for this study will be posted on GitHub (</em><span><span><em>https://github.com/VcRenOne/Dual--PAL-G</em></span><svg><path></path></svg></span><em>)</em></div></div>\",\"PeriodicalId\":50627,\"journal\":{\"name\":\"Computers and Electronics in Agriculture\",\"volume\":\"237 \",\"pages\":\"Article 110661\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers and Electronics in Agriculture\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0168169925007677\",\"RegionNum\":1,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers and Electronics in Agriculture","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0168169925007677","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

作物有害生物识别是农业研究的一个重要方向，对提高作物产量和对精准农业有害生物进行科学分类具有重要意义。传统的深度学习害虫识别通常在单一类别和场景以及单峰信息上训练专有模型，取得了优异的性能。然而，该方案普遍知识基础薄弱，可移植性不足，单模信息对害虫背景和不同生命阶段的识别效果有限。近年来，通过全微调将大型预训练模型（Large pre-trained models， LPTM）的一般知识转移到特定领域已成为一种有效的解决方案。然而，全面的微调需要大量的数据和操作人员资源来有效地适应所有参数。为此，本文提出了一种以多模态大模型CLIP作为预训练模型的作物有害生物识别的跨模态参数高效微调（PEFT）统一框架。该方法采用CLIP作为图像和文本模式的编码器，引入了Dual-(PAL)G模型。首先，将可学习的提示序列嵌入到编码器的输入层或隐藏层中。其次，在全连通层的维度扩展层中并行替换多模态LoRA；然后，Gate单元集成了三种PEFT方法-提示，适配器和LoRA，以增强学习能力。设计了GSC-Adapter和参数高效的Light-GCS-Adapter进行跨模态语义信息融合。为了验证该方法的有效性，我们在作物有害生物识别的公共数据集上进行了大量实验。首先，在公共数据集IP102（用于细粒度识别）上，我们以66%的样本量超过了ViT和Swin Transformer。在枸杞害虫数据集WPIT9K中，仅使用约15%的样本量，就超过了之前最先进的模型ITF-WPI，达到98%的准确率。它在八项一般任务上也表现出色。本研究为农业害虫识别领域提供了新的技术解决方案。该方案可以在样本较少的情况下，将多模态LPTM的一般知识有效地转移到特定的害虫识别领域，只需要引入最少的参数。同时，该方法在跨模态识别任务中具有通用性。这项研究的代码将发布在GitHub （https://github.com/VcRenOne/Dual--PAL-G）上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Transfer large models to crop pest recognition—A cross-modal unified framework for parameters efficient fine-tuning

查看原文本刊更多论文

Transfer large models to crop pest recognition—A cross-modal unified framework for parameters efficient fine-tuning

Crop pest recognition is an important direction in agricultural research, which is of great significance for improving crop yield and scientifically classifying pests for precision agriculture. Traditional deep learning pest recognition usually trains proprietary models on single categories and scenes as well as unimodal information, achieving excellent performance. However, this scheme has a weak foundation of general knowledge, insufficient transferability, and unimodal information has limited effect on the recognition of pest background and different life stages. In recent years, transferring the general knowledge of Large pre-trained models (LPTM) to specific domains through full fine-tuning has become an effective solution. However, full fine-tuning requires massive data and operator resources to effectively adapt all parameters. Therefore, this paper proposes a cross-modal parameter efficient fine-tuning (PEFT) unified framework for crop pest recognition with the multimodal large model CLIP as the pre-training model. The proposed method employs CLIP as the encoder for both image and text modalities, introducing the Dual-

{(PAL)}^{G}

model. Firstly, learnable Prompt sequences are embedded in the input or hidden layers of the encoder. Secondly, multimodal LoRA is parallelly replaced in the dimension expansion layer of the fully connected layer. Then, the Gate unit integrates three PEFT methods—Prompt, Adapter, and LoRA, to enhance learning ability. We designed the GSC-Adapter and the parameter-efficient Light-GCS-Adapter for cross-modal semantic information fusion. To verify the effectiveness of the method, we conducted a large number of experiments on public datasets for crop pest recognition. Firstly, on the public dataset IP102 (for fine-grained recognition), we surpassed ViT and Swin Transformer with 66% of the sample size. In wolfberry pest dataset WPIT9K, using only about 15% of the sample size, it surpasses the previous state-of-the-art model ITF-WPI, achieving 98% accuracy. It also shows excellent performance on eight general tasks. This study provides a new technical solution for the field of agricultural pest recognition . This solution can efficiently transfer the general knowledge of multimodal LPTM to the specific pest recognition field under the condition of a few samples, with only a minimal number of parameters introduced. At the same time, this method has universality in cross-modal recognition tasks. The code for this study will be posted on GitHub (https://github.com/VcRenOne/Dual--PAL-G)

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers and Electronics in Agriculture 工程技术-计算机：跨学科应用

CiteScore

15.30

自引率

14.50%

发文量

800

审稿时长

62 days

期刊介绍： Computers and Electronics in Agriculture provides international coverage of advancements in computer hardware, software, electronic instrumentation, and control systems applied to agricultural challenges. Encompassing agronomy, horticulture, forestry, aquaculture, and animal farming, the journal publishes original papers, reviews, and applications notes. It explores the use of computers and electronics in plant or animal agricultural production, covering topics like agricultural soils, water, pests, controlled environments, and waste. The scope extends to on-farm post-harvest operations and relevant technologies, including artificial intelligence, sensors, machine vision, robotics, networking, and simulation modeling. Its companion journal, Smart Agricultural Technology, continues the focus on smart applications in production agriculture.