TDA-GLM: text data augmentation for aquaculture disease prevention and control via a small model-guided ChatGLM

IF 2.4 3区农林科学 Q2 FISHERIES

Aquaculture International Pub Date : 2025-04-14 DOI:10.1007/s10499-025-01945-6

Shihan Qiao, Hong Yu, Jing Song, Lixin Zhang, Huiyuan Zhao, Wei Huang

{"title":"TDA-GLM: text data augmentation for aquaculture disease prevention and control via a small model-guided ChatGLM","authors":"Shihan Qiao, Hong Yu, Jing Song, Lixin Zhang, Huiyuan Zhao, Wei Huang","doi":"10.1007/s10499-025-01945-6","DOIUrl":null,"url":null,"abstract":"<div><p>Disease prevention and control is crucial for the healthy development of the aquaculture industry, and high-quality data are the foundation for intelligent disease management. However, textual data in this field are scarce and variable in quality. The direct use of large language models (LLMs) for data augmentation often results in poor-quality outputs. In this paper, we propose a small-sample data augmentation framework called TDA-GLM. This framework employs a strategy that combines a high-quality corpus, pretrained large model, and efficient prompts, which improves data augmentation by integrating high-quality corpora and refining prompt techniques. To address the poor adaptability of LLMs in aquaculture disease prevention and control, which leads to low-quality data augmentation, we introduce a supervised learning model. This model guides the understanding of domain knowledge and optimizes data augmentation tasks within the LLM, enabling the generation of data that are both highly similar to the original professional material and rich in content. Additionally, we design a noise removal module that filters out noise by analyzing the consistency between the augmented data and the data augmentation target, thereby increasing the overall quality of the augmented data. To verify the domain reliability of our data augmentation approach, we conducted a few-shot learning classification task experiment. The results demonstrated significant improvements by our model over existing advanced text data augmentation techniques, with the key performance indicators Acc, P, R, and F1 reaching 94.86%, 95.55%, 94.86%, and 94.62%, respectively. Furthermore, in the experiments assessing the quality of the augmented samples, the similarity and enrichment degrees reached 94.95% and 64.41%, respectively. These results indicate that our augmented samples are of very high quality, ensuring that the core semantics of the original data are preserved while increasing data diversity through appropriate variations.</p></div>","PeriodicalId":8122,"journal":{"name":"Aquaculture International","volume":"33 4","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquaculture International","FirstCategoryId":"97","ListUrlMain":"https://link.springer.com/article/10.1007/s10499-025-01945-6","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"FISHERIES","Score":null,"Total":0}

引用次数: 0

Abstract

Disease prevention and control is crucial for the healthy development of the aquaculture industry, and high-quality data are the foundation for intelligent disease management. However, textual data in this field are scarce and variable in quality. The direct use of large language models (LLMs) for data augmentation often results in poor-quality outputs. In this paper, we propose a small-sample data augmentation framework called TDA-GLM. This framework employs a strategy that combines a high-quality corpus, pretrained large model, and efficient prompts, which improves data augmentation by integrating high-quality corpora and refining prompt techniques. To address the poor adaptability of LLMs in aquaculture disease prevention and control, which leads to low-quality data augmentation, we introduce a supervised learning model. This model guides the understanding of domain knowledge and optimizes data augmentation tasks within the LLM, enabling the generation of data that are both highly similar to the original professional material and rich in content. Additionally, we design a noise removal module that filters out noise by analyzing the consistency between the augmented data and the data augmentation target, thereby increasing the overall quality of the augmented data. To verify the domain reliability of our data augmentation approach, we conducted a few-shot learning classification task experiment. The results demonstrated significant improvements by our model over existing advanced text data augmentation techniques, with the key performance indicators Acc, P, R, and F1 reaching 94.86%, 95.55%, 94.86%, and 94.62%, respectively. Furthermore, in the experiments assessing the quality of the augmented samples, the similarity and enrichment degrees reached 94.95% and 64.41%, respectively. These results indicate that our augmented samples are of very high quality, ensuring that the core semantics of the original data are preserved while increasing data diversity through appropriate variations.

查看原文本刊更多论文

TDA-GLM：通过小型模型指导的 ChatGLM 进行水产养殖疾病防治的文本数据扩增

疾病防控是水产养殖业健康发展的关键，高质量的数据是疾病智能管理的基础。然而，这一领域的文本数据是稀缺的，而且质量参差不齐。直接使用大型语言模型（llm）进行数据扩充通常会导致低质量的输出。本文提出了一个小样本数据增强框架TDA-GLM。该框架采用了高质量语料库、预训练大型模型和高效提示相结合的策略，通过集成高质量语料库和精炼提示技术，提高了数据扩充能力。针对llm在水产养殖疾病防控中的适应性差，导致数据增强质量不高的问题，我们引入了监督学习模型。该模型指导了对领域知识的理解，并优化了LLM内的数据增强任务，使生成的数据既与原始专业材料高度相似，又内容丰富。此外，我们设计了一个噪声去除模块，通过分析增强数据与数据增强目标之间的一致性来过滤噪声，从而提高增强数据的整体质量。为了验证我们的数据增强方法的领域可靠性，我们进行了少量学习分类任务实验。结果表明，与现有的高级文本数据增强技术相比，我们的模型有了显著的改进，关键性能指标Acc、P、R和F1分别达到94.86%、95.55%、94.86%和94.62%。在增强后的样品质量评价实验中，相似度和富集度分别达到94.95%和64.41%。这些结果表明，我们的增强样本具有非常高的质量，确保了原始数据的核心语义被保留，同时通过适当的变化增加了数据的多样性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Aquaculture International 农林科学-渔业

CiteScore

5.10

自引率

6.90%

发文量

204

审稿时长

1.0 months

期刊介绍： Aquaculture International is an international journal publishing original research papers, short communications, technical notes and review papers on all aspects of aquaculture. The Journal covers topics such as the biology, physiology, pathology and genetics of cultured fish, crustaceans, molluscs and plants, especially new species; water quality of supply systems, fluctuations in water quality within farms and the environmental impacts of aquacultural operations; nutrition, feeding and stocking practices, especially as they affect the health and growth rates of cultured species; sustainable production techniques; bioengineering studies on the design and management of offshore and land-based systems; the improvement of quality and marketing of farmed products; sociological and societal impacts of aquaculture, and more. This is the official Journal of the European Aquaculture Society.