Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz
{"title":"利用开放信息提取技术构建植物形态描述信息","authors":"Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz","doi":"10.3897/biss.7.113055","DOIUrl":null,"url":null,"abstract":"Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form <noun phrase, relation phrase, noun phrase> (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodiversity Institute of Costa Rica (INBio). Specifically, the project focused on selecting records of morphological descriptions of plant species written in Spanish. The system processes the morphological descriptions using a workflow that includes phases like data selection and pre-processing, feature extraction, test PLM, local language model training, and test and evaluate results. Fig. 2 shows the general workflow used in this research. Pre-processing and Annotation: Descriptions were standardized by removing special characters like double and single quotes, replacing abbreviations, tokenizing text, and other transformations. Some records of the dataset were annotated with the ground-truth structured information in the form of triples that were extracted from each paragraph. Additionally, structured data from the project carried out by Mora and Araya (Mora and Araya 2018) were included in the dataset. Feature extraction: The token vectorization was done using word embedding directly by the language models. Test PLM: The evaluation process of PLM models used the zero-shot approach and involved applying the models to the test dataset, extracting information, and comparing it to annotated ground truth. Local Language Model Training: The annotated data was split into 80% training data and 20% test data. Using the training data, a language model based on the Transformers architecture was trained. Evaluate results: Evaluation metrics such as precision, recall, and F1 (a meaure of the model's accuracy) were calculated comparing the extracted information and the ground truth. The results were analyzed to understand the models' performance, identify strengths and weaknesses, and gain insights into their ability to extract accurate and relevant information. Based on the analysis, the evaluation process iteratively improved models results. The main contributions of this project are: A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website. A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Structuring Information from Plant Morphological Descriptions using Open Information Extraction\",\"authors\":\"Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz\",\"doi\":\"10.3897/biss.7.113055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form <noun phrase, relation phrase, noun phrase> (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodiversity Institute of Costa Rica (INBio). Specifically, the project focused on selecting records of morphological descriptions of plant species written in Spanish. The system processes the morphological descriptions using a workflow that includes phases like data selection and pre-processing, feature extraction, test PLM, local language model training, and test and evaluate results. Fig. 2 shows the general workflow used in this research. Pre-processing and Annotation: Descriptions were standardized by removing special characters like double and single quotes, replacing abbreviations, tokenizing text, and other transformations. Some records of the dataset were annotated with the ground-truth structured information in the form of triples that were extracted from each paragraph. Additionally, structured data from the project carried out by Mora and Araya (Mora and Araya 2018) were included in the dataset. Feature extraction: The token vectorization was done using word embedding directly by the language models. Test PLM: The evaluation process of PLM models used the zero-shot approach and involved applying the models to the test dataset, extracting information, and comparing it to annotated ground truth. Local Language Model Training: The annotated data was split into 80% training data and 20% test data. Using the training data, a language model based on the Transformers architecture was trained. Evaluate results: Evaluation metrics such as precision, recall, and F1 (a meaure of the model's accuracy) were calculated comparing the extracted information and the ground truth. The results were analyzed to understand the models' performance, identify strengths and weaknesses, and gain insights into their ability to extract accurate and relevant information. Based on the analysis, the evaluation process iteratively improved models results. The main contributions of this project are: A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website. A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website.\",\"PeriodicalId\":9011,\"journal\":{\"name\":\"Biodiversity Information Science and Standards\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodiversity Information Science and Standards\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3897/biss.7.113055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.113055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
分类学文献记录了地球的生物多样性,并为研究和可持续管理提供了所需的知识。产生的出版物数量相当大:生物多样性文献的语料库包括数以千万计的图和分类处理。不幸的是,大多数的分类描述都是来自科学出版物的文本格式。生物多样性遗产图书馆(BHL)的数字化页面超过6100万页,而生物多样性文献库中仅有467,265份分类处理。从数字化文本中获取高度结构化的文本已被证明是复杂且非常昂贵的(Cui et al. 2021)。科学界已经描述了超过120万种物种,但研究表明,地球上86%的现有物种和91%的海洋物种仍有待描述(Mora et al. 2011)。已发表的描述综合了分类学家几个世纪以来的研究成果,包括物种的详细形态学方面(即形状和结构),有助于识别标本,改进信息搜索机制,对具有特定特征的物种进行数据分析,并比较物种描述。为了充分利用这些信息并努力将其与生物多样性知识库整合,生物多样性信息学社区首先需要将纯文本转换为机器可处理的格式。更准确地说,需要识别结构和子结构名称以及描述它们的字符(图1)。开放信息提取(OIE)是自然语言处理(NLP)的一个研究领域,旨在自动提取非结构化文本中可用数据的结构化、机器可读表示;通常结果被处理为n元命题,例如,形式为& & &;名词短语,关系短语,名词短语& & &;gt;(Shen et al. 2022)。随着自然语言处理和机器学习技术的进步,OIE不断发展。OIE的最新技术包括使用神经方法、预先训练的语言模型,以及依赖性分析和语义角色标记的集成。神经解决方案主要将OIE表述为序列标记问题或序列生成问题。目前的研究重点是提高提取精度;例如,处理复杂的语言现象,解决诸如共指解析之类的挑战;以及更开放的信息提取,因为大多数现有的神经解决方案都适用于英语文本(Zhou et al. 2022)。该项目的主要目的是评估和比较使用预训练语言模型(PLM)和使用西班牙语植物形态描述数据训练的语言模型从植物形态描述自动提取数据的结果。本研究的数据来源于哥斯达黎加国家生物多样性研究所(INBio)的物种记录数据库。具体而言,该项目侧重于选择用西班牙语写的植物物种形态描述记录。系统使用工作流处理形态学描述,该工作流包括数据选择和预处理、特征提取、测试PLM、本地语言模型训练以及测试和评估结果等阶段。图2显示了本研究中使用的一般工作流程。预处理和注释:通过删除特殊字符(如双引号和单引号)、替换缩写、标记文本和其他转换,对描述进行了标准化。数据集的一些记录用从每个段落中提取的三元组形式的基本事实结构化信息进行注释。此外,数据集中还包括Mora和Araya (Mora和Araya 2018)开展的项目中的结构化数据。特征提取:通过语言模型直接使用词嵌入进行标记矢量化。测试PLM: PLM模型的评估过程使用零射击方法,包括将模型应用于测试数据集,提取信息,并将其与注释的地面真值进行比较。局部语言模型训练:将标注的数据分成80%的训练数据和20%的测试数据。利用训练数据,对基于变形金刚架构的语言模型进行了训练。评估结果:计算评估指标,如精度、召回率和F1(模型准确性的度量),比较提取的信息和基本事实。对结果进行分析,以了解模型的性能,识别优点和缺点,并深入了解它们提取准确和相关信息的能力。在分析的基础上,对模型结果进行了迭代改进。
Structuring Information from Plant Morphological Descriptions using Open Information Extraction
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form <noun phrase, relation phrase, noun phrase> (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodiversity Institute of Costa Rica (INBio). Specifically, the project focused on selecting records of morphological descriptions of plant species written in Spanish. The system processes the morphological descriptions using a workflow that includes phases like data selection and pre-processing, feature extraction, test PLM, local language model training, and test and evaluate results. Fig. 2 shows the general workflow used in this research. Pre-processing and Annotation: Descriptions were standardized by removing special characters like double and single quotes, replacing abbreviations, tokenizing text, and other transformations. Some records of the dataset were annotated with the ground-truth structured information in the form of triples that were extracted from each paragraph. Additionally, structured data from the project carried out by Mora and Araya (Mora and Araya 2018) were included in the dataset. Feature extraction: The token vectorization was done using word embedding directly by the language models. Test PLM: The evaluation process of PLM models used the zero-shot approach and involved applying the models to the test dataset, extracting information, and comparing it to annotated ground truth. Local Language Model Training: The annotated data was split into 80% training data and 20% test data. Using the training data, a language model based on the Transformers architecture was trained. Evaluate results: Evaluation metrics such as precision, recall, and F1 (a meaure of the model's accuracy) were calculated comparing the extracted information and the ground truth. The results were analyzed to understand the models' performance, identify strengths and weaknesses, and gain insights into their ability to extract accurate and relevant information. Based on the analysis, the evaluation process iteratively improved models results. The main contributions of this project are: A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website. A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website.