Classifying literature mentions of biological pathogens as experimentally studied using natural language processing.

IF 1.6 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Antonio Jose Jimeno Yepes, Karin Verspoor
{"title":"Classifying literature mentions of biological pathogens as experimentally studied using natural language processing.","authors":"Antonio Jose Jimeno Yepes,&nbsp;Karin Verspoor","doi":"10.1186/s13326-023-00282-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.</p><p><strong>Objective: </strong>In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.</p><p><strong>Methods: </strong>We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.</p><p><strong>Results: </strong>We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.</p><p><strong>Conclusions: </strong>We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest.</p><p><strong>Trial registration: </strong>N/A.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"1"},"PeriodicalIF":1.6000,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9889128/pdf/","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-023-00282-y","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 2

Abstract

Background: Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.

Objective: In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.

Methods: We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.

Results: We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.

Conclusions: We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest.

Trial registration: N/A.

Abstract Image

Abstract Image

将提及生物病原体的文献分类为使用自然语言处理进行实验研究。
背景:有关致病病原体(包括病毒和细菌)的机制、管理和治疗的信息可以从MEDLINE上的研究出版物中轻易获得。然而,在实验研究的基础上确定具体表征这些病原体及其特性的文献,这对于理解这些病原体引起的疾病的分子基础很重要,需要筛选大量文章,以排除偶然提及病原体的情况,或在公共卫生等其他非实验环境中提及病原体。目的:在这项工作中,我们为开发科学文献中病原体提及的自动表征方法奠定了基础,重点是识别涉及在实验背景下对病原体进行实验研究的研究。目前还没有可用于此目的的手动注释病原体语料库,而这些资源对于支持基于机器学习的模型的开发是必要的。因此,我们的目标是填补这一空白,在任务定义的一些简化假设下,从MEDLINE自动生成一个大型数据集,并使用它来探索专门支持检测研究出版物中提及的实验研究病原体的自动方法。方法:我们使用我们提供的NCBI资源自动开发了一个病原体提及表征文献数据集——READBiomed病原体。NCBI分类法、MeSH和GenBank等资源可以有效地用于识别有关实验研究病原体的相关文献,更具体地说,使用MeSH链接到MEDLINE引文,包括实验研究病原体标题和摘要。我们实验了几种基于机器学习的自然语言处理(NLP)算法,利用这些数据集作为训练数据,对检测专门描述病原体实验研究的论文的任务进行建模。结果:我们表明,我们的数据集READBiomed病原体可用于探索实验病原体提及表征的自然语言处理配置。READBiomed病原体包括与生物体相关的引文,包括细菌、病毒、少量毒素和其他致病因子。结论:我们研究了科学文献中实验研究病原体的特征,开发了几种由自动开发的数据集支持的自然语言处理方法。作为这项工作的核心贡献,我们提出了一种利用现有生物医学资源自动构建病原体识别数据集的方法。数据集和注释代码是公开的。病原体提及识别和表征算法的性能在一个小的手动注释数据集上进行了额外评估,表明我们生成的数据集允许表征感兴趣的病原体。试用注册:不适用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Biomedical Semantics
Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
4.20
自引率
5.30%
发文量
28
审稿时长
30 weeks
期刊介绍: Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信