Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing.

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Pub Date : 2024-05-31 eCollection Date: 2024-01-01

Alberto Purpura, Natasha Mulligan, Uri Kartoun, Eileen Koski, Vibha Anand, Joao Bettencourt-Silva

{"title":"Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing.","authors":"Alberto Purpura, Natasha Mulligan, Uri Kartoun, Eileen Koski, Vibha Anand, Joao Bettencourt-Silva","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>This paper addresses the challenge of binary relation classification in biomedical Natural Language Processing (NLP), focusing on diverse domains including gene-disease associations, compound protein interactions, and social determinants of health (SDOH). We evaluate different approaches, including fine-tuning Bidirectional Encoder Representations from Transformers (BERT) models and generative Large Language Models (LLMs), and examine their performance in zero and few-shot settings. We also introduce a novel dataset of biomedical text annotated with social and clinical entities to facilitate research into relation classification. Our results underscore the continued complexity of this task for both humans and models. BERT-based models trained on domain-specific data excelled in certain domains and achieved comparable performance and generalization power to generative LLMs in others. Despite these encouraging results, these models are still far from achieving human-level performance. We also highlight the significance of high-quality training data and domain-specific fine-tuning on the performance of all the considered models.</p>","PeriodicalId":72181,"journal":{"name":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","volume":"2024 ","pages":"384-390"},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141837/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper addresses the challenge of binary relation classification in biomedical Natural Language Processing (NLP), focusing on diverse domains including gene-disease associations, compound protein interactions, and social determinants of health (SDOH). We evaluate different approaches, including fine-tuning Bidirectional Encoder Representations from Transformers (BERT) models and generative Large Language Models (LLMs), and examine their performance in zero and few-shot settings. We also introduce a novel dataset of biomedical text annotated with social and clinical entities to facilitate research into relation classification. Our results underscore the continued complexity of this task for both humans and models. BERT-based models trained on domain-specific data excelled in certain domains and achieved comparable performance and generalization power to generative LLMs in others. Despite these encouraging results, these models are still far from achieving human-level performance. We also highlight the significance of high-quality training data and domain-specific fine-tuning on the performance of all the considered models.

本刊更多论文

研究生物医学自然语言处理中的跨域二元关系分类。

本文探讨了生物医学自然语言处理（NLP）中二元关系分类所面临的挑战，重点关注基因-疾病关联、复合蛋白质相互作用和健康的社会决定因素（SDOH）等不同领域。我们评估了不同的方法，包括微调变换器双向编码器表征（BERT）模型和生成式大型语言模型（LLM），并检验了它们在零点和少点设置下的性能。我们还引入了一个标注了社会和临床实体的生物医学文本新数据集，以促进关系分类研究。我们的研究结果凸显了这项任务对于人类和模型的持续复杂性。基于特定领域数据训练的 BERT 模型在某些领域表现出色，而在其他领域则取得了与生成式 LLM 相媲美的性能和泛化能力。尽管取得了这些令人鼓舞的结果，但这些模型仍远未达到人类水平。我们还强调了高质量的训练数据和特定领域的微调对所有模型性能的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

自引率

0.00%

发文量