{"title":"TDA-GLM: text data augmentation for aquaculture disease prevention and control via a small model-guided ChatGLM","authors":"Shihan Qiao, Hong Yu, Jing Song, Lixin Zhang, Huiyuan Zhao, Wei Huang","doi":"10.1007/s10499-025-01945-6","DOIUrl":null,"url":null,"abstract":"<div><p>Disease prevention and control is crucial for the healthy development of the aquaculture industry, and high-quality data are the foundation for intelligent disease management. However, textual data in this field are scarce and variable in quality. The direct use of large language models (LLMs) for data augmentation often results in poor-quality outputs. In this paper, we propose a small-sample data augmentation framework called TDA-GLM. This framework employs a strategy that combines a high-quality corpus, pretrained large model, and efficient prompts, which improves data augmentation by integrating high-quality corpora and refining prompt techniques. To address the poor adaptability of LLMs in aquaculture disease prevention and control, which leads to low-quality data augmentation, we introduce a supervised learning model. This model guides the understanding of domain knowledge and optimizes data augmentation tasks within the LLM, enabling the generation of data that are both highly similar to the original professional material and rich in content. Additionally, we design a noise removal module that filters out noise by analyzing the consistency between the augmented data and the data augmentation target, thereby increasing the overall quality of the augmented data. To verify the domain reliability of our data augmentation approach, we conducted a few-shot learning classification task experiment. The results demonstrated significant improvements by our model over existing advanced text data augmentation techniques, with the key performance indicators Acc, P, R, and F1 reaching 94.86%, 95.55%, 94.86%, and 94.62%, respectively. Furthermore, in the experiments assessing the quality of the augmented samples, the similarity and enrichment degrees reached 94.95% and 64.41%, respectively. These results indicate that our augmented samples are of very high quality, ensuring that the core semantics of the original data are preserved while increasing data diversity through appropriate variations.</p></div>","PeriodicalId":8122,"journal":{"name":"Aquaculture International","volume":"33 4","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aquaculture International","FirstCategoryId":"97","ListUrlMain":"https://link.springer.com/article/10.1007/s10499-025-01945-6","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"FISHERIES","Score":null,"Total":0}
引用次数: 0
Abstract
Disease prevention and control is crucial for the healthy development of the aquaculture industry, and high-quality data are the foundation for intelligent disease management. However, textual data in this field are scarce and variable in quality. The direct use of large language models (LLMs) for data augmentation often results in poor-quality outputs. In this paper, we propose a small-sample data augmentation framework called TDA-GLM. This framework employs a strategy that combines a high-quality corpus, pretrained large model, and efficient prompts, which improves data augmentation by integrating high-quality corpora and refining prompt techniques. To address the poor adaptability of LLMs in aquaculture disease prevention and control, which leads to low-quality data augmentation, we introduce a supervised learning model. This model guides the understanding of domain knowledge and optimizes data augmentation tasks within the LLM, enabling the generation of data that are both highly similar to the original professional material and rich in content. Additionally, we design a noise removal module that filters out noise by analyzing the consistency between the augmented data and the data augmentation target, thereby increasing the overall quality of the augmented data. To verify the domain reliability of our data augmentation approach, we conducted a few-shot learning classification task experiment. The results demonstrated significant improvements by our model over existing advanced text data augmentation techniques, with the key performance indicators Acc, P, R, and F1 reaching 94.86%, 95.55%, 94.86%, and 94.62%, respectively. Furthermore, in the experiments assessing the quality of the augmented samples, the similarity and enrichment degrees reached 94.95% and 64.41%, respectively. These results indicate that our augmented samples are of very high quality, ensuring that the core semantics of the original data are preserved while increasing data diversity through appropriate variations.
期刊介绍:
Aquaculture International is an international journal publishing original research papers, short communications, technical notes and review papers on all aspects of aquaculture.
The Journal covers topics such as the biology, physiology, pathology and genetics of cultured fish, crustaceans, molluscs and plants, especially new species; water quality of supply systems, fluctuations in water quality within farms and the environmental impacts of aquacultural operations; nutrition, feeding and stocking practices, especially as they affect the health and growth rates of cultured species; sustainable production techniques; bioengineering studies on the design and management of offshore and land-based systems; the improvement of quality and marketing of farmed products; sociological and societal impacts of aquaculture, and more.
This is the official Journal of the European Aquaculture Society.