The Role of the CLIP Model in Analysing Herbarium Specimen Images

Vamsi Krishna Kommineni, Jens Kattge, Jitendra Gaikwad, Susanne Tautenhahn, Birgitta Koenig-ries
{"title":"The Role of the CLIP Model in Analysing Herbarium Specimen Images","authors":"Vamsi Krishna Kommineni, Jens Kattge, Jitendra Gaikwad, Susanne Tautenhahn, Birgitta Koenig-ries","doi":"10.3897/biss.7.112566","DOIUrl":null,"url":null,"abstract":"The number of openly-accessible digital plant specimen images is growing tremendously and available through data aggregators: Global Biodiversity Information Facility (GBIF) contains 43.2 million images, and Intergrated Digitized Biocollections (iDigBio) contains 32.4 million images (Accessed on 29.06.2023). All these images contain great ecological (morphological, phenological, taxonomic etc.) information, which has the potential to facilitate the conduct of large-scale analyses. However, extracting this information from these images and making it available to analysis tools remains challenging and requires more advanced computer vision algorithms. With the latest advancements in the natural language processing field, it is becoming possible to analyse images with text prompts. For example, with the Contrastive Language-Image Pre-Training (CLIP) model, which was trained on 400 million image-text pairs, it is feasible to classify day-to-day life images by providing different text prompts and an image as an input to the model, then the model can predict the most suitable text prompt for the input image. We explored the feasibility of using the CLIP model to analyse digital plant specimen images. A particular focus of this study was on the generation of appropriate text prompts. This is important as the prompt has a large influence on the results of the model. We experimented with three different methods: a) automatic text prompt based on metadata of the specific image or other datasets, b) automatic generic text prompt of the image (describing what is in the image) and c) manual text prompt by annotating the image. We investigated the suitability of these prompts with an experiment, where we tested whether the CLIP model could recognize a herbarium specimen image using digital plant specimen images and semantically disparate text prompts. Our ultimate goal is to filter the digital plant specimen images based on the availability of intact leaves and measurement scale to reduce the number of specimens that reach the downstream pipeline, for instance, the segmentation task for the leaf trait extraction process. To achieve the goal, we are fine-tuning the CLIP model with a dataset of around 20,000 digital plant specimen image-text prompt pairs, where the text prompts were generated using different datasets, metadata and generic text prompt methods. Since the text prompts can be created automatically, it is possible to eradicate the laborious manual annotating process. In conclusion, we present our experimental testing of the CLIP model on digital plant specimen images with varied settings and how the CLIP model can act as a potential filtering tool. In future, we plan to investigate the possibility of using text prompts to do the instance segmentation to extract leaf trait information using Large Language Models (LLMs).","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The number of openly-accessible digital plant specimen images is growing tremendously and available through data aggregators: Global Biodiversity Information Facility (GBIF) contains 43.2 million images, and Intergrated Digitized Biocollections (iDigBio) contains 32.4 million images (Accessed on 29.06.2023). All these images contain great ecological (morphological, phenological, taxonomic etc.) information, which has the potential to facilitate the conduct of large-scale analyses. However, extracting this information from these images and making it available to analysis tools remains challenging and requires more advanced computer vision algorithms. With the latest advancements in the natural language processing field, it is becoming possible to analyse images with text prompts. For example, with the Contrastive Language-Image Pre-Training (CLIP) model, which was trained on 400 million image-text pairs, it is feasible to classify day-to-day life images by providing different text prompts and an image as an input to the model, then the model can predict the most suitable text prompt for the input image. We explored the feasibility of using the CLIP model to analyse digital plant specimen images. A particular focus of this study was on the generation of appropriate text prompts. This is important as the prompt has a large influence on the results of the model. We experimented with three different methods: a) automatic text prompt based on metadata of the specific image or other datasets, b) automatic generic text prompt of the image (describing what is in the image) and c) manual text prompt by annotating the image. We investigated the suitability of these prompts with an experiment, where we tested whether the CLIP model could recognize a herbarium specimen image using digital plant specimen images and semantically disparate text prompts. Our ultimate goal is to filter the digital plant specimen images based on the availability of intact leaves and measurement scale to reduce the number of specimens that reach the downstream pipeline, for instance, the segmentation task for the leaf trait extraction process. To achieve the goal, we are fine-tuning the CLIP model with a dataset of around 20,000 digital plant specimen image-text prompt pairs, where the text prompts were generated using different datasets, metadata and generic text prompt methods. Since the text prompts can be created automatically, it is possible to eradicate the laborious manual annotating process. In conclusion, we present our experimental testing of the CLIP model on digital plant specimen images with varied settings and how the CLIP model can act as a potential filtering tool. In future, we plan to investigate the possibility of using text prompts to do the instance segmentation to extract leaf trait information using Large Language Models (LLMs).
CLIP模型在分析植物标本图像中的作用
开放访问的数字植物标本图像数量正在急剧增长,并可通过数据聚合器获得:全球生物多样性信息设施(GBIF)包含4320万张图像,集成数字化生物收藏馆(iDigBio)包含3240万张图像(于2023年6月29日访问)。所有这些图像都包含大量的生态信息(形态、物候、分类等),有可能促进大规模分析的进行。然而,从这些图像中提取这些信息并将其用于分析工具仍然具有挑战性,并且需要更先进的计算机视觉算法。随着自然语言处理领域的最新进展,用文本提示来分析图像已经成为可能。例如,对比语言图像预训练(CLIP)模型经过4亿对图像-文本对的训练,可以通过提供不同的文本提示并将图像作为模型的输入,对日常生活图像进行分类,然后模型可以预测出最适合输入图像的文本提示。我们探索了使用CLIP模型分析数字植物标本图像的可行性。本研究的一个特别重点是生成适当的文本提示。这一点很重要,因为提示对模型的结果有很大的影响。我们尝试了三种不同的方法:a)基于特定图像或其他数据集的元数据的自动文本提示,b)图像的自动通用文本提示(描述图像中的内容)和c)通过注释图像的手动文本提示。我们通过实验研究了这些提示符的适用性,在实验中,我们测试了CLIP模型是否可以使用数字植物标本图像和语义不同的文本提示符识别植物标本馆标本图像。我们的最终目标是基于完整叶片的可用性和测量尺度对数字植物标本图像进行过滤,以减少到达下游管道的标本数量,例如叶片特征提取过程的分割任务。为了实现这一目标,我们正在使用大约20,000个数字植物标本图像-文本提示对数据集对CLIP模型进行微调,其中文本提示是使用不同的数据集、元数据和通用文本提示方法生成的。由于可以自动创建文本提示,因此可以消除费力的手动注释过程。总之,我们介绍了CLIP模型在不同设置的数字植物标本图像上的实验测试,以及CLIP模型如何作为潜在的过滤工具。未来,我们计划研究使用大型语言模型(LLMs)使用文本提示进行实例分割以提取叶片性状信息的可能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信