Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments.

IF 1.6 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Mariana Neves, Antonina Klippert, Fanny Knöspel, Juliane Rudeck, Ailine Stolz, Zsofia Ban, Markus Becker, Kai Diederich, Barbara Grune, Pia Kahnau, Nils Ohnesorge, Johannes Pucher, Gilbert Schönfelder, Bettina Bert, Daniel Butzke
{"title":"Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments.","authors":"Mariana Neves, Antonina Klippert, Fanny Knöspel, Juliane Rudeck, Ailine Stolz, Zsofia Ban, Markus Becker, Kai Diederich, Barbara Grune, Pia Kahnau, Nils Ohnesorge, Johannes Pucher, Gilbert Schönfelder, Bettina Bert, Daniel Butzke","doi":"10.1186/s13326-023-00292-w","DOIUrl":null,"url":null,"abstract":"<p><p>Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: \"in vivo\", \"organs\", \"primary cells\", \"immortal cell lines\", \"invertebrates\", \"humans\", \"in silico\" and \"other\" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for \"others\") to 0.82 (for \"invertebrates\"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - \"Smart feature-based interactive\" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"13"},"PeriodicalIF":1.6000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10472567/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-023-00292-w","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).

Abstract Image

Abstract Image

生物医学文献中实验模型的自动分类,以支持寻找动物实验的替代方法。
目前的动物保护法要求用替代方法替代动物实验,只要这些方法适合达到预期的科学目标。然而,在科学文献中寻找替代方法是一项耗时的任务,需要仔细筛选大量的实验性生物医学出版物。识别潜在的相关方法,例如器官或细胞培养模型,或计算机模拟,可以通过专门为此目的构建的文本挖掘工具来支持。这些工具是在人类专家标记的相关数据集上训练(或微调)的。我们开发了GoldHamster语料库,该语料库由1600篇PubMed (Medline)文章(标题和摘要)组成,其中我们根据一组8个标签手动识别使用的实验模型,即:“体内”、“器官”、“原代细胞”、“不朽细胞系”、“无脊椎动物”、“人类”、“计算机”和“其他”(模型)。我们招募了13名具有生物医学领域专业知识的注释者,并将每篇文章分配给两个人。另外四轮注释旨在提高第一轮中存在分歧的注释的质量。此外,我们进行了各种基于监督学习的机器学习实验,以评估我们分类任务的语料库。我们为上述标签获得了7000多个文档级别的注释。在第一轮标注之后,标注者之间的一致性(kappa系数)在标签之间变化,范围从0.42(“其他”)到0.82(“无脊椎动物”),总分为0.62。在随后的几轮注释中,所有分歧都得到了解决。表现最好的机器学习实验使用了PubMedBERT预训练模型,并对我们的语料库进行了微调,其总体f分数为0.83。我们获得了一个对所有标签都具有高度一致性的语料库,我们的评估表明,根据使用的实验模型,我们的语料库适合用于训练可靠的生物医学文献自动分类预测模型。我们的SMAFIRA——“基于智能特征的交互式”搜索工具(https://smafira.bf3r.de)将使用这个分类器来支持动物实验替代方法的检索。语料库可以下载(https://doi.org/10.5281/zenodo.7152295),也可以下载源代码(https://github.com/mariananeves/goldhamster)和模型(https://huggingface.co/SMAFIRA/goldhamster)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Biomedical Semantics
Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
4.20
自引率
5.30%
发文量
28
审稿时长
30 weeks
期刊介绍: Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信