Potential of natural language processing for metadata extraction from environmental scientific publications

4区农林科学 Q2 Agricultural and Biological Sciences

Soil Science Pub Date : 2023-03-14 DOI:10.5194/soil-9-155-2023

G. Blanchy, Lukas Albrecht, J. Koestel, S. Garré

{"title":"Potential of natural language processing for metadata extraction from environmental scientific publications","authors":"G. Blanchy, Lukas Albrecht, J. Koestel, S. Garré","doi":"10.5194/soil-9-155-2023","DOIUrl":null,"url":null,"abstract":"Abstract. Summarizing information from large bodies of scientific literature is an\nessential but work-intensive task. This is especially true in environmental\nstudies where multiple factors (e.g., soil, climate, vegetation) can\ncontribute to the effects observed. Meta-analyses, studies that\nquantitatively summarize findings of a large body of literature, rely on\nmanually curated databases built upon primary publications. However, given\nthe increasing amount of literature, this manual work is likely to require\nmore and more effort in the future. Natural language processing (NLP)\nfacilitates this task, but it is not clear yet to which extent the\nextraction process is reliable or complete. In this work, we explore three\nNLP techniques that can help support this task: topic modeling, tailored\nregular expressions and the shortest dependency path method. We apply these\ntechniques in a practical and reproducible workflow on two corpora of\ndocuments: the Open Tension-disk\nInfiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source\npublications of the entries of the OTIM database of near-saturated hydraulic\nconductivity from tension-disk infiltrometer measurements\n(https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of\nall primary studies from 36 selected meta-analyses on the impact of\nagricultural practices on sustainable water management in Europe. As a first\nstep of our practical workflow, we identified different topics from the\nindividual source publications of the Meta corpus using topic modeling.\nThis enabled us to distinguish well-researched topics (e.g., conventional\ntillage, cover crops), where meta-analysis would be useful, from neglected\ntopics (e.g., effect of irrigation on soil properties), showing potential\nknowledge gaps. Then, we used tailored regular expressions to extract\ncoordinates, soil texture, soil type, rainfall, disk diameter and tensions\nfrom the OTIM corpus to build a quantitative database. We were able to\nretrieve the respective information with 56 % up to 100 % of all\nrelevant information (recall) and with a precision between 83 % and\n100 %. Finally, we extracted relationships between a set of drivers\ncorresponding to different soil management practices or amendments (e.g.,\n“biochar”, “zero tillage”) and target variables (e.g., “soil\naggregate”, “hydraulic conductivity”, “crop yield”) from the\nsource publications' abstracts of the Meta corpus using the shortest\ndependency path between them. These relationships were further classified\naccording to positive, negative or absent correlations between the driver\nand the target variable. This quickly provided an overview of the different\ndriver–variable relationships and their abundance for an entire body of\nliterature. Overall, we found that all three tested NLP techniques were able\nto support evidence synthesis tasks. While human supervision remains\nessential, NLP methods have the potential to support automated evidence\nsynthesis which can be continuously updated as new publications become\navailable.\n","PeriodicalId":22015,"journal":{"name":"Soil Science","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Soil Science","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.5194/soil-9-155-2023","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract. Summarizing information from large bodies of scientific literature is an essential but work-intensive task. This is especially true in environmental studies where multiple factors (e.g., soil, climate, vegetation) can contribute to the effects observed. Meta-analyses, studies that quantitatively summarize findings of a large body of literature, rely on manually curated databases built upon primary publications. However, given the increasing amount of literature, this manual work is likely to require more and more effort in the future. Natural language processing (NLP) facilitates this task, but it is not clear yet to which extent the extraction process is reliable or complete. In this work, we explore three NLP techniques that can help support this task: topic modeling, tailored regular expressions and the shortest dependency path method. We apply these techniques in a practical and reproducible workflow on two corpora of documents: the Open Tension-disk Infiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source publications of the entries of the OTIM database of near-saturated hydraulic conductivity from tension-disk infiltrometer measurements (https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of all primary studies from 36 selected meta-analyses on the impact of agricultural practices on sustainable water management in Europe. As a first step of our practical workflow, we identified different topics from the individual source publications of the Meta corpus using topic modeling. This enabled us to distinguish well-researched topics (e.g., conventional tillage, cover crops), where meta-analysis would be useful, from neglected topics (e.g., effect of irrigation on soil properties), showing potential knowledge gaps. Then, we used tailored regular expressions to extract coordinates, soil texture, soil type, rainfall, disk diameter and tensions from the OTIM corpus to build a quantitative database. We were able to retrieve the respective information with 56 % up to 100 % of all relevant information (recall) and with a precision between 83 % and 100 %. Finally, we extracted relationships between a set of drivers corresponding to different soil management practices or amendments (e.g., “biochar”, “zero tillage”) and target variables (e.g., “soil aggregate”, “hydraulic conductivity”, “crop yield”) from the source publications' abstracts of the Meta corpus using the shortest dependency path between them. These relationships were further classified according to positive, negative or absent correlations between the driver and the target variable. This quickly provided an overview of the different driver–variable relationships and their abundance for an entire body of literature. Overall, we found that all three tested NLP techniques were able to support evidence synthesis tasks. While human supervision remains essential, NLP methods have the potential to support automated evidence synthesis which can be continuously updated as new publications become available.

查看原文本刊更多论文

自然语言处理在环境科学出版物元数据提取中的潜力

摘要从大量科学文献中总结信息是一项必要但又需要大量工作的任务。在环境研究中尤其如此，因为多种因素(如土壤、气候、植被)可能导致观察到的影响。荟萃分析，即定量总结大量文献发现的研究，依赖于建立在主要出版物基础上的人工管理数据库。然而，鉴于文献数量的增加，这种手工工作在未来可能需要越来越多的努力。自然语言处理(NLP)促进了这项任务，但目前尚不清楚提取过程在多大程度上是可靠或完整的。在这项工作中，我们探索了三种可以帮助支持这项任务的enlp技术:主题建模、定制正则表达式和最短依赖路径方法。我们将这些技术应用于两个文档语料库上的实用和可重复的工作流:开放张力磁盘过滤计元数据库(OTIM)和元语料库。OTIM语料包含OTIM近饱和水力导电性数据库条目的源出版物，这些条目来自张力盘渗透计测量(https://github.com/climasoma/otim-db，最后访问时间:2023年3月1日)。该Meta语料库由36项精选的关于欧洲农业实践对可持续水资源管理影响的Meta分析的所有主要研究组成。作为我们实际工作流程的第一步，我们使用主题建模从元语料库的单个源出版物中识别不同的主题。这使我们能够区分研究充分的主题(例如，传统耕作，覆盖作物)和被忽视的主题(例如，灌溉对土壤特性的影响)，其中元分析是有用的，这显示了潜在的知识差距。然后，我们使用定制的正则表达式从OTIM语料库中提取坐标、土壤质地、土壤类型、降雨量、圆盘直径和张力，建立定量数据库。我们能够以56%到100%的相关信息(召回率)检索相应的信息，准确率在83%到100%之间。最后，我们从Meta语料库的源出版物摘要中提取了与不同土壤管理实践或修订(例如，“生物炭”，“零耕作”)相对应的一组驱动因素与目标变量(例如，“土壤团聚体”，“水力导电性”，“作物产量”)之间的关系，使用它们之间的最短依赖路径。这些关系进一步根据驱动程序和目标变量之间的正相关、负相关或不相关进行分类。这很快为整个文献提供了不同的驱动变量关系及其丰富程度的概述。总的来说，我们发现所有三种测试的NLP技术都能够支持证据合成任务。虽然人类监督仍然是必不可少的，但NLP方法有可能支持自动证据合成，随着新出版物的出现，自动证据合成可以不断更新。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Soil Science 农林科学-土壤科学

CiteScore

2.70

自引率

0.00%

发文量

审稿时长

4.4 months

期刊介绍： Cessation.Soil Science satisfies the professional needs of all scientists and laboratory personnel involved in soil and plant research by publishing primary research reports and critical reviews of basic and applied soil science, especially as it relates to soil and plant studies and general environmental soil science. Each month, Soil Science presents authoritative research articles from an impressive array of discipline: soil chemistry and biochemistry, physics, fertility and nutrition, soil genesis and morphology, soil microbiology and mineralogy. Of immediate relevance to soil scientists-both industrial and academic-this unique publication also has long-range value for agronomists and environmental scientists.