Maya Sahraoui, Youcef Sklab, Marc Pignal, Régine Vignes Lebbe, Vincent Guigue
{"title":"Leveraging Multimodality for Biodiversity Data: Exploring joint representations of species descriptions and specimen images using CLIP","authors":"Maya Sahraoui, Youcef Sklab, Marc Pignal, Régine Vignes Lebbe, Vincent Guigue","doi":"10.3897/biss.7.112666","DOIUrl":null,"url":null,"abstract":"In recent years, the field of biodiversity data analysis has witnessed significant advancements, with a number of models emerging to process and extract valuable insights from various data sources. One notable area of progress lies in the analysis of species descriptions, where structured knowledge extraction techniques have gained prominence. These techniques aim to automatically extract relevant information from unstructured text, such as taxonomic classifications and morphological traits. (Sahraoui et al. 2022, Sahraoui et al. 2023) By applying natural language processing (NLP) and machine learning methods, structured knowledge extraction enables the conversion of textual species descriptions into a structured format, facilitating easier integration, searchability, and analysis of biodiversity data. Furthermore, object detection on specimen images has emerged as a powerful tool in biodiversity research. By leveraging computer vision algorithms (Triki et al. 2020, Triki et al. 2021,Ott et al. 2020), researchers can automatically identify and classify objects of interest within specimen images, such as organs, anatomical features, or specific taxa. Object detection techniques allow for the efficient and accurate extraction of valuable information, contributing to tasks like species identification, morphological trait analysis, and biodiversity monitoring. These advancements have been particularly significant in the context of herbarium collections and digitization efforts, where large volumes of specimen images need to be processed and analyzed. On the other hand, multimodal learning, an emerging field in artificial intelligence (AI), focuses on developing models that can effectively process and learn from multiple modalities, such as text and images (Li et al. 2020, Li et al. 2021, Li et al. 2019, Radford et al. 2021, Sun et al. 2021, Chen et al. 2022). By incorporating information from different modalities, multimodal learning aims to capture the rich and complementary characteristics present in diverse data sources. This approach enables the model to leverage the strengths of each modality, leading to enhanced understanding, improved performance, and more comprehensive representations. Structured knowledge extraction from species descriptions and object detection on specimen images synergistically enhances biodiversity data analysis. This integration leverages textual and visual data strengths, gaining deeper insights. Extracted structured information from descriptions improves search, classification, and correlation of biodiversity data. Object detection enriches textual descriptions, providing visual evidence for the verification and validation of species characteristics. To tackle the challenges posed by the massive volume of specimen images available at the Herbarium of the National Museum of Natural History in Paris, we have chosen to implement the CLIP (Contrastive Language-Image Pretraining) model (Radford et al. 2021) developed by OpenAI. CLIP utilizes a contrastive learning framework to recognize joint representations of text and images. The model is trained on a large-scale dataset consisting of text-image pairs from the internet, enabling it to understand the semantic relationships between textual descriptions and visual content. Fine-tuning the CLIP model on our dataset of species descriptions and specimen images is crucial for adapting it to our domain. By exposing the model to our data, we enhance its ability to understand and represent biodiversity characteristics. This involves training the model on our labeled dataset, allowing it to refine its knowledge and adapt to biodiversity patterns. Using the fine-tuned CLIP model, we aim to develop an efficient search engine for the Herbarium's vast biodiversity collection. Users can query the engine with morphological keywords, and it will match textual descriptions with specimen images to provide relevant results. This research aligns with the current AI trajectory for biodiversity data, paving the way for innovative approaches to address conservation and understanding of our planet's biodiversity.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, the field of biodiversity data analysis has witnessed significant advancements, with a number of models emerging to process and extract valuable insights from various data sources. One notable area of progress lies in the analysis of species descriptions, where structured knowledge extraction techniques have gained prominence. These techniques aim to automatically extract relevant information from unstructured text, such as taxonomic classifications and morphological traits. (Sahraoui et al. 2022, Sahraoui et al. 2023) By applying natural language processing (NLP) and machine learning methods, structured knowledge extraction enables the conversion of textual species descriptions into a structured format, facilitating easier integration, searchability, and analysis of biodiversity data. Furthermore, object detection on specimen images has emerged as a powerful tool in biodiversity research. By leveraging computer vision algorithms (Triki et al. 2020, Triki et al. 2021,Ott et al. 2020), researchers can automatically identify and classify objects of interest within specimen images, such as organs, anatomical features, or specific taxa. Object detection techniques allow for the efficient and accurate extraction of valuable information, contributing to tasks like species identification, morphological trait analysis, and biodiversity monitoring. These advancements have been particularly significant in the context of herbarium collections and digitization efforts, where large volumes of specimen images need to be processed and analyzed. On the other hand, multimodal learning, an emerging field in artificial intelligence (AI), focuses on developing models that can effectively process and learn from multiple modalities, such as text and images (Li et al. 2020, Li et al. 2021, Li et al. 2019, Radford et al. 2021, Sun et al. 2021, Chen et al. 2022). By incorporating information from different modalities, multimodal learning aims to capture the rich and complementary characteristics present in diverse data sources. This approach enables the model to leverage the strengths of each modality, leading to enhanced understanding, improved performance, and more comprehensive representations. Structured knowledge extraction from species descriptions and object detection on specimen images synergistically enhances biodiversity data analysis. This integration leverages textual and visual data strengths, gaining deeper insights. Extracted structured information from descriptions improves search, classification, and correlation of biodiversity data. Object detection enriches textual descriptions, providing visual evidence for the verification and validation of species characteristics. To tackle the challenges posed by the massive volume of specimen images available at the Herbarium of the National Museum of Natural History in Paris, we have chosen to implement the CLIP (Contrastive Language-Image Pretraining) model (Radford et al. 2021) developed by OpenAI. CLIP utilizes a contrastive learning framework to recognize joint representations of text and images. The model is trained on a large-scale dataset consisting of text-image pairs from the internet, enabling it to understand the semantic relationships between textual descriptions and visual content. Fine-tuning the CLIP model on our dataset of species descriptions and specimen images is crucial for adapting it to our domain. By exposing the model to our data, we enhance its ability to understand and represent biodiversity characteristics. This involves training the model on our labeled dataset, allowing it to refine its knowledge and adapt to biodiversity patterns. Using the fine-tuned CLIP model, we aim to develop an efficient search engine for the Herbarium's vast biodiversity collection. Users can query the engine with morphological keywords, and it will match textual descriptions with specimen images to provide relevant results. This research aligns with the current AI trajectory for biodiversity data, paving the way for innovative approaches to address conservation and understanding of our planet's biodiversity.
近年来,生物多样性数据分析领域取得了重大进展,出现了许多模型来处理和提取来自各种数据源的有价值的见解。一个值得注意的进步领域是物种描述的分析,其中结构化知识提取技术已经获得突出。这些技术旨在从非结构化文本中自动提取相关信息,如分类分类和形态特征。(Sahraoui et al. 2022, Sahraoui et al. 2023)通过应用自然语言处理(NLP)和机器学习方法,结构化知识提取能够将文本物种描述转换为结构化格式,从而便于生物多样性数据的集成、可搜索性和分析。此外,标本图像的目标检测已成为生物多样性研究的有力工具。通过利用计算机视觉算法(Triki et al. 2020, Triki et al. 2021,Ott et al. 2020),研究人员可以自动识别和分类标本图像中感兴趣的物体,如器官、解剖特征或特定分类群。目标检测技术可以有效、准确地提取有价值的信息,有助于完成物种鉴定、形态特征分析和生物多样性监测等任务。这些进步在植物标本馆收集和数字化工作的背景下尤为重要,因为需要处理和分析大量标本图像。另一方面,多模式学习是人工智能(AI)的一个新兴领域,专注于开发能够有效处理和学习多种模式的模型,如文本和图像(Li et al. 2020, Li et al. 2021, Li et al. 2019, Radford et al. 2021, Sun et al. 2021, Chen et al. 2022)。通过整合来自不同模式的信息,多模式学习旨在捕捉不同数据源中存在的丰富和互补特征。这种方法使模型能够利用每种模态的优势,从而增强理解、改进性能和更全面的表示。物种描述的结构化知识提取和标本图像的目标检测协同增强了生物多样性数据分析。这种集成利用了文本和可视化数据的优势,获得了更深入的见解。从描述中提取结构化信息,提高了生物多样性数据的检索、分类和相关性。目标检测丰富了文本描述,为物种特征的验证和验证提供了视觉证据。为了应对巴黎国家自然历史博物馆植物标本馆提供的大量标本图像所带来的挑战,我们选择实施由OpenAI开发的CLIP(对比语言-图像预训练)模型(Radford et al. 2021)。CLIP利用对比学习框架来识别文本和图像的联合表示。该模型在一个由来自互联网的文本-图像对组成的大规模数据集上进行训练,使其能够理解文本描述和视觉内容之间的语义关系。在我们的物种描述和标本图像数据集上微调CLIP模型对于使其适应我们的领域至关重要。通过将模型暴露于我们的数据,我们增强了模型理解和表示生物多样性特征的能力。这包括在我们的标记数据集上训练模型,使其能够改进其知识并适应生物多样性模式。利用经过微调的CLIP模型,我们的目标是为植物标本馆的大量生物多样性收藏开发一个高效的搜索引擎。用户可以用形态学关键词查询引擎,它会将文本描述与标本图像进行匹配,提供相关结果。这项研究与当前生物多样性数据的人工智能轨迹一致,为解决保护和理解地球生物多样性的创新方法铺平了道路。