{"title":"Potential of natural language processing for metadata extraction from environmental scientific publications","authors":"G. Blanchy, Lukas Albrecht, J. Koestel, S. Garré","doi":"10.5194/soil-9-155-2023","DOIUrl":null,"url":null,"abstract":"Abstract. Summarizing information from large bodies of scientific literature is an\nessential but work-intensive task. This is especially true in environmental\nstudies where multiple factors (e.g., soil, climate, vegetation) can\ncontribute to the effects observed. Meta-analyses, studies that\nquantitatively summarize findings of a large body of literature, rely on\nmanually curated databases built upon primary publications. However, given\nthe increasing amount of literature, this manual work is likely to require\nmore and more effort in the future. Natural language processing (NLP)\nfacilitates this task, but it is not clear yet to which extent the\nextraction process is reliable or complete. In this work, we explore three\nNLP techniques that can help support this task: topic modeling, tailored\nregular expressions and the shortest dependency path method. We apply these\ntechniques in a practical and reproducible workflow on two corpora of\ndocuments: the Open Tension-disk\nInfiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source\npublications of the entries of the OTIM database of near-saturated hydraulic\nconductivity from tension-disk infiltrometer measurements\n(https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of\nall primary studies from 36 selected meta-analyses on the impact of\nagricultural practices on sustainable water management in Europe. As a first\nstep of our practical workflow, we identified different topics from the\nindividual source publications of the Meta corpus using topic modeling.\nThis enabled us to distinguish well-researched topics (e.g., conventional\ntillage, cover crops), where meta-analysis would be useful, from neglected\ntopics (e.g., effect of irrigation on soil properties), showing potential\nknowledge gaps. Then, we used tailored regular expressions to extract\ncoordinates, soil texture, soil type, rainfall, disk diameter and tensions\nfrom the OTIM corpus to build a quantitative database. We were able to\nretrieve the respective information with 56 % up to 100 % of all\nrelevant information (recall) and with a precision between 83 % and\n100 %. Finally, we extracted relationships between a set of drivers\ncorresponding to different soil management practices or amendments (e.g.,\n“biochar”, “zero tillage”) and target variables (e.g., “soil\naggregate”, “hydraulic conductivity”, “crop yield”) from the\nsource publications' abstracts of the Meta corpus using the shortest\ndependency path between them. These relationships were further classified\naccording to positive, negative or absent correlations between the driver\nand the target variable. This quickly provided an overview of the different\ndriver–variable relationships and their abundance for an entire body of\nliterature. Overall, we found that all three tested NLP techniques were able\nto support evidence synthesis tasks. While human supervision remains\nessential, NLP methods have the potential to support automated evidence\nsynthesis which can be continuously updated as new publications become\navailable.\n","PeriodicalId":22015,"journal":{"name":"Soil Science","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Soil Science","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.5194/soil-9-155-2023","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 1
Abstract
Abstract. Summarizing information from large bodies of scientific literature is an
essential but work-intensive task. This is especially true in environmental
studies where multiple factors (e.g., soil, climate, vegetation) can
contribute to the effects observed. Meta-analyses, studies that
quantitatively summarize findings of a large body of literature, rely on
manually curated databases built upon primary publications. However, given
the increasing amount of literature, this manual work is likely to require
more and more effort in the future. Natural language processing (NLP)
facilitates this task, but it is not clear yet to which extent the
extraction process is reliable or complete. In this work, we explore three
NLP techniques that can help support this task: topic modeling, tailored
regular expressions and the shortest dependency path method. We apply these
techniques in a practical and reproducible workflow on two corpora of
documents: the Open Tension-disk
Infiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source
publications of the entries of the OTIM database of near-saturated hydraulic
conductivity from tension-disk infiltrometer measurements
(https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of
all primary studies from 36 selected meta-analyses on the impact of
agricultural practices on sustainable water management in Europe. As a first
step of our practical workflow, we identified different topics from the
individual source publications of the Meta corpus using topic modeling.
This enabled us to distinguish well-researched topics (e.g., conventional
tillage, cover crops), where meta-analysis would be useful, from neglected
topics (e.g., effect of irrigation on soil properties), showing potential
knowledge gaps. Then, we used tailored regular expressions to extract
coordinates, soil texture, soil type, rainfall, disk diameter and tensions
from the OTIM corpus to build a quantitative database. We were able to
retrieve the respective information with 56 % up to 100 % of all
relevant information (recall) and with a precision between 83 % and
100 %. Finally, we extracted relationships between a set of drivers
corresponding to different soil management practices or amendments (e.g.,
“biochar”, “zero tillage”) and target variables (e.g., “soil
aggregate”, “hydraulic conductivity”, “crop yield”) from the
source publications' abstracts of the Meta corpus using the shortest
dependency path between them. These relationships were further classified
according to positive, negative or absent correlations between the driver
and the target variable. This quickly provided an overview of the different
driver–variable relationships and their abundance for an entire body of
literature. Overall, we found that all three tested NLP techniques were able
to support evidence synthesis tasks. While human supervision remains
essential, NLP methods have the potential to support automated evidence
synthesis which can be continuously updated as new publications become
available.
期刊介绍:
Cessation.Soil Science satisfies the professional needs of all scientists and laboratory personnel involved in soil and plant research by publishing primary research reports and critical reviews of basic and applied soil science, especially as it relates to soil and plant studies and general environmental soil science.
Each month, Soil Science presents authoritative research articles from an impressive array of discipline: soil chemistry and biochemistry, physics, fertility and nutrition, soil genesis and morphology, soil microbiology and mineralogy. Of immediate relevance to soil scientists-both industrial and academic-this unique publication also has long-range value for agronomists and environmental scientists.