使用转换器模型和多标签分类的出版物类型标记。

AMIA ... Annual Symposium proceedings. AMIA Symposium Pub Date : 2025-05-22 eCollection Date: 2024-01-01

Joe D Menke, Halil Kilicoglu, Neil R Smalheiser

{"title":"使用转换器模型和多标签分类的出版物类型标记。","authors":"Joe D Menke, Halil Kilicoglu, Neil R Smalheiser","doi":"","DOIUrl":null,"url":null,"abstract":"Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"818-827"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099436/pdf/","citationCount":"0","resultStr":"{\"title\":\"Publication Type Tagging using Transformer Models and Multi-Label Classification.\",\"authors\":\"Joe D Menke, Halil Kilicoglu, Neil R Smalheiser\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.\",\"PeriodicalId\":72180,\"journal\":{\"name\":\"AMIA ... Annual Symposium proceedings. AMIA Symposium\",\"volume\":\"2024 \",\"pages\":\"818-827\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099436/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AMIA ... Annual Symposium proceedings. AMIA Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA ... Annual Symposium proceedings. AMIA Symposium","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

根据发表类型和研究设计对文章进行索引对于生物医学文献的有效搜索和过滤是必不可少的，但与基于MeSH主题术语的索引相比，研究还不够充分。在这项研究中，我们利用PubMed中人工管理的出版物类型和研究设计来生成超过120万篇文章（标题和摘要）的数据集，并使用最先进的基于transformer的模型来自动标记出版物类型和研究设计。具体来说，我们使用多标签分类方法训练基于pubmedbert的模型，并探索欠采样、特征语言化和对比学习来提高模型性能。我们的结果表明，PubMedBERT为出版物类型和研究设计索引提供了强有力的基线；欠采样、特征语言化和无监督的约束损失对性能有积极影响，而监督的对比学习则会降低性能。我们在80%的欠采样和特征语言化时获得了最佳的总体性能（0.632 macro-F1, 0.969 macro-AUC）。该模型在所有指标上都优于以前的模型（MultiTagger），性能差异具有统计学意义（p < 0.001）。尽管该模型的性能更强，但仍有改进的空间，未来的工作可以探索基于全文的特征以及模型的可解释性。我们在https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA上提供我们的数据和代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Publication Type Tagging using Transformer Models and Multi-Label Classification.

Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AMIA ... Annual Symposium proceedings. AMIA Symposium

自引率

0.00%

发文量