Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro
{"title":"基于变换的信息提取管道从文本中提取生殖条件和栖息地信息","authors":"Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro","doi":"10.3897/biss.7.112505","DOIUrl":null,"url":null,"abstract":"Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as \"fruited heavily\" and \"mass flowering\"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline\",\"authors\":\"Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro\",\"doi\":\"10.3897/biss.7.112505\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as \\\"fruited heavily\\\" and \\\"mass flowering\\\"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.\",\"PeriodicalId\":9011,\"journal\":{\"name\":\"Biodiversity Information Science and Standards\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodiversity Information Science and Standards\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3897/biss.7.112505\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
了解植物物种自然再生的生物学基础,以便制定有效的再造林计划是一项复杂的任务。这可以通过提供对包含物种分布、栖息地和繁殖的长期和大规模地理信息的数据库的访问来帮助实现。尽管存在广泛使用的生物多样性数据库,其中包含物种及其发生的结构化信息,如全球生物多样性信息设施(GBIF)和澳大利亚生活地图集(ALA),但关于生物多样性的大部分知识仍然嵌入文本文件中。如果有工具和服务可以自动从文本中提取有意义的信息,并将其存储为结构化格式,例如开放的生物多样性数据库,准备用于分析,那么非结构化信息可以更容易获取,对大规模研究也更有用(Thessen et al. 2022)。我们的目标是丰富生物多样性发生数据库与物种的生殖条件和生境信息,源自文本。在之前的工作中,我们开发了无监督的方法来提取相关的栖息地及其位置,以及相关的生殖状况和时间表达(Gabud and Batista-Navarro 2018)。我们构建了一种新的无监督混合方法用于关系提取(RE),该方法结合了经典的基于规则的模式匹配方法和基于转换器的语言模型,将我们的关系提取任务框架为自然语言推理(NLI)任务。使用我们的混合方法,即使没有大型训练数据集,我们也能够从文本中提取相关的生物多样性实体。在这项工作中,我们实现了一个由命名实体识别(NER)工具和我们的混合关系提取(RE)工具组成的信息提取(IE)管道。NER工具是一种基于转换器的语言模型,该模型在科学文本上进行预训练,然后使用COPIOUS(通过理解大数据保护菲律宾生物多样性;Nguyen et al. 2019),这是一个包含与物种发生相关的命名实体的金标准语料库。应用NER工具对句子中包含的地理位置、时态表达和栖息地信息进行自动标注。然后使用基于词典的方法来识别文本中提到的生殖条件(例如,短语,如“大量开花”和“大量开花”)。然后,我们使用混合RE工具提取生殖条件-时间表达和栖息地-地理位置实体对。我们在CABI数字图书馆(国际农业和生物科学中心)提供的林业纲要上测试了我们的IE管道,并表明我们的工作能够丰富物种繁殖和栖息地条件的描述性信息。这项工作是朝着加强生物多样性数据库迈出的一步,其中包括从文本中提取的生境和生殖条件信息。
Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline
Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as "fruited heavily" and "mass flowering"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.