减少农业数据的注释工作量：使用DINOv2和K-means进行简单快速的无监督核心集选择。

IF 4.1 2区生物学 Q1 PLANT SCIENCES

Frontiers in Plant Science Pub Date : 2025-05-14 eCollection Date: 2025-01-01 DOI:10.3389/fpls.2025.1546756

Laura Gómez-Zamanillo, Nagore Portilla, Artzai Picón, Itziar Egusquiza, Ramón Navarra-Mestre, Andoni Elola, Arantza Bereciartua-Perez

{"title":"减少农业数据的注释工作量：使用DINOv2和K-means进行简单快速的无监督核心集选择。","authors":"Laura Gómez-Zamanillo, Nagore Portilla, Artzai Picón, Itziar Egusquiza, Ramón Navarra-Mestre, Andoni Elola, Arantza Bereciartua-Perez","doi":"10.3389/fpls.2025.1546756","DOIUrl":null,"url":null,"abstract":"The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance. In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.","PeriodicalId":12632,"journal":{"name":"Frontiers in Plant Science","volume":"16 ","pages":"1546756"},"PeriodicalIF":4.1000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12116677/pdf/","citationCount":"0","resultStr":"{\"title\":\"Reducing annotation effort in agricultural data: simple and fast unsupervised coreset selection with DINOv2 and K-means.\",\"authors\":\"Laura Gómez-Zamanillo, Nagore Portilla, Artzai Picón, Itziar Egusquiza, Ramón Navarra-Mestre, Andoni Elola, Arantza Bereciartua-Perez\",\"doi\":\"10.3389/fpls.2025.1546756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance. In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.\",\"PeriodicalId\":12632,\"journal\":{\"name\":\"Frontiers in Plant Science\",\"volume\":\"16 \",\"pages\":\"1546756\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12116677/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Plant Science\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.3389/fpls.2025.1546756\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"PLANT SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Plant Science","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.3389/fpls.2025.1546756","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"PLANT SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

对大量注释数据的需求是在农业应用中采用深度学习的主要障碍，在农业应用中，注释通常是耗时的，并且需要专业知识。为了解决这个问题，已经开发出方法来选择代表数据集中现有可变性的数据进行手动注释，从而避免冗余信息。核心集选择方法旨在选择最能代表整个数据集的数据样本的一小部分。因此，这些方法可以用来选择一个减少的样本集进行注释，优化深度学习模型的训练，以获得最佳性能。在这项工作中，我们提出了一种简单而有效的核心集选择方法，该方法将最新的基础模型DINOv2作为强大的特征选择器与著名的K-Means聚类方法相结合。从每个计算的聚类中选择样本，形成最终的核心集。通过比较随机约简数据集上训练的多类分类模型的性能指标，验证了所提方法的有效性。该验证是在两个不同的数据集上进行的，在这两种情况下，所提出的方法都取得了更好的结果，在训练数据集上显著减少F1分数，提高了0.15。此外，研究了使用DINOv2作为特征提取器来获得这些良好结果的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reducing annotation effort in agricultural data: simple and fast unsupervised coreset selection with DINOv2 and K-means.

The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance. In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Plant Science PLANT SCIENCES-

CiteScore

7.30

自引率

14.30%

发文量

4844

审稿时长

14 weeks

期刊介绍： In an ever changing world, plant science is of the utmost importance for securing the future well-being of humankind. Plants provide oxygen, food, feed, fibers, and building materials. In addition, they are a diverse source of industrial and pharmaceutical chemicals. Plants are centrally important to the health of ecosystems, and their understanding is critical for learning how to manage and maintain a sustainable biosphere. Plant science is extremely interdisciplinary, reaching from agricultural science to paleobotany, and molecular physiology to ecology. It uses the latest developments in computer science, optics, molecular biology and genomics to address challenges in model systems, agricultural crops, and ecosystems. Plant science research inquires into the form, function, development, diversity, reproduction, evolution and uses of both higher and lower plants and their interactions with other organisms throughout the biosphere. Frontiers in Plant Science welcomes outstanding contributions in any field of plant science from basic to applied research, from organismal to molecular studies, from single plant analysis to studies of populations and whole ecosystems, and from molecular to biophysical to computational approaches. Frontiers in Plant Science publishes articles on the most outstanding discoveries across a wide research spectrum of Plant Science. The mission of Frontiers in Plant Science is to bring all relevant Plant Science areas together on a single platform.