Comparison of pipelines, seq2seq models, and LLMs for rare disease information extraction.

Shashank Gupta, Xuguang Ai, Yuhang Jiang, Ramakanth Kavuluru
{"title":"Comparison of pipelines, seq2seq models, and LLMs for rare disease information extraction.","authors":"Shashank Gupta, Xuguang Ai, Yuhang Jiang, Ramakanth Kavuluru","doi":"10.1007/978-3-031-97141-9_4","DOIUrl":null,"url":null,"abstract":"<p><p>End-to-end relation extraction (E2ERE) is an important application of natural language processing (NLP) in biomedicine. The extracted relations populate knowledge graphs and drive more high level applications in knowledge discovery and information retrieval. E2ERE is frequently handled at the sentence level involving continuous entities. A more complex setting is document level E2ERE with discontinuous and overlapping/nested entities. We identified a recently introduced RE dataset for rare diseases (RareDis) that has these complex traits. Among current E2ERE methods, we see three well-known paradigms: (1) pipeline based approaches where a named entity recognition (NER) model's output is input to a relation classification (RC) model; (2) joint sequence-to-sequence style models where the raw input text is directly transformed into relations through linearization schemas; and (3) generative large language models (LLMs), where prompts, fine-tuning, and in-context learning are being leveraged for RE. While LLMs are becoming popular because of tools such as ChatGPT, the biomedical NLP community needs to carefully evaluate which paradigm is more suitable for E2ERE. In this effort, using the RareDis dataset as a complex use-case, we evaluate the best representative models from each of the three paradigms for E2ERE. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind. We verify these findings on a second E2ERE dataset for chemical-protein interactions. Although LLMs are more suitable for zero-shot settings, our results show that it is better to work with more conventional models trained and tailored for E2ERE when training data is available. Our contribution is also the first to conduct E2ERE for the RareDis dataset.</p>","PeriodicalId":92107,"journal":{"name":"Natural language processing and information systems : ... International Conference on Applications of Natural Language to Information Systems, NLDB ... revised papers. International Conference on Applications of Natural Language to Info...","volume":"15836 ","pages":"49-63"},"PeriodicalIF":0.0000,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12367198/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural language processing and information systems : ... International Conference on Applications of Natural Language to Information Systems, NLDB ... revised papers. International Conference on Applications of Natural Language to Info...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-031-97141-9_4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

End-to-end relation extraction (E2ERE) is an important application of natural language processing (NLP) in biomedicine. The extracted relations populate knowledge graphs and drive more high level applications in knowledge discovery and information retrieval. E2ERE is frequently handled at the sentence level involving continuous entities. A more complex setting is document level E2ERE with discontinuous and overlapping/nested entities. We identified a recently introduced RE dataset for rare diseases (RareDis) that has these complex traits. Among current E2ERE methods, we see three well-known paradigms: (1) pipeline based approaches where a named entity recognition (NER) model's output is input to a relation classification (RC) model; (2) joint sequence-to-sequence style models where the raw input text is directly transformed into relations through linearization schemas; and (3) generative large language models (LLMs), where prompts, fine-tuning, and in-context learning are being leveraged for RE. While LLMs are becoming popular because of tools such as ChatGPT, the biomedical NLP community needs to carefully evaluate which paradigm is more suitable for E2ERE. In this effort, using the RareDis dataset as a complex use-case, we evaluate the best representative models from each of the three paradigms for E2ERE. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind. We verify these findings on a second E2ERE dataset for chemical-protein interactions. Although LLMs are more suitable for zero-shot settings, our results show that it is better to work with more conventional models trained and tailored for E2ERE when training data is available. Our contribution is also the first to conduct E2ERE for the RareDis dataset.

罕见病信息提取管道、seq2seq模型和llm的比较
端到端关系提取(E2ERE)是自然语言处理(NLP)在生物医学中的重要应用。提取的关系填充知识图,并推动知识发现和信息检索的高级应用。E2ERE通常在涉及连续实体的句子级处理。更复杂的设置是具有不连续和重叠/嵌套实体的文档级E2ERE。我们确定了一个最近引入的罕见病(RareDis)的RE数据集,该数据集具有这些复杂的特征。在当前的E2ERE方法中,我们看到了三种众所周知的范例:(1)基于管道的方法,其中命名实体识别(NER)模型的输出输入到关系分类(RC)模型;(2)联合序列到序列样式模型,通过线性化模式将原始输入文本直接转换为关系;(3)生成式大型语言模型(llm),其中提示、微调和上下文学习被用于RE。虽然llm由于ChatGPT等工具而变得流行,但生物医学NLP社区需要仔细评估哪种范式更适合E2ERE。在这项工作中,使用RareDis数据集作为一个复杂的用例,我们从E2ERE的三个范例中评估了最具代表性的模型。我们的研究结果表明,管道模型仍然是最好的,而序列到序列模型也不落后。我们在第二个E2ERE化学-蛋白质相互作用数据集上验证了这些发现。虽然llm更适合零射击设置,但我们的研究结果表明,当训练数据可用时,更好地使用为E2ERE训练和定制的更传统的模型。我们的贡献也是第一个对RareDis数据集进行E2ERE的人。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信