{"title":"Comparison of pipelines, seq2seq models, and LLMs for rare disease information extraction.","authors":"Shashank Gupta, Xuguang Ai, Yuhang Jiang, Ramakanth Kavuluru","doi":"10.1007/978-3-031-97141-9_4","DOIUrl":null,"url":null,"abstract":"<p><p>End-to-end relation extraction (E2ERE) is an important application of natural language processing (NLP) in biomedicine. The extracted relations populate knowledge graphs and drive more high level applications in knowledge discovery and information retrieval. E2ERE is frequently handled at the sentence level involving continuous entities. A more complex setting is document level E2ERE with discontinuous and overlapping/nested entities. We identified a recently introduced RE dataset for rare diseases (RareDis) that has these complex traits. Among current E2ERE methods, we see three well-known paradigms: (1) pipeline based approaches where a named entity recognition (NER) model's output is input to a relation classification (RC) model; (2) joint sequence-to-sequence style models where the raw input text is directly transformed into relations through linearization schemas; and (3) generative large language models (LLMs), where prompts, fine-tuning, and in-context learning are being leveraged for RE. While LLMs are becoming popular because of tools such as ChatGPT, the biomedical NLP community needs to carefully evaluate which paradigm is more suitable for E2ERE. In this effort, using the RareDis dataset as a complex use-case, we evaluate the best representative models from each of the three paradigms for E2ERE. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind. We verify these findings on a second E2ERE dataset for chemical-protein interactions. Although LLMs are more suitable for zero-shot settings, our results show that it is better to work with more conventional models trained and tailored for E2ERE when training data is available. Our contribution is also the first to conduct E2ERE for the RareDis dataset.</p>","PeriodicalId":92107,"journal":{"name":"Natural language processing and information systems : ... International Conference on Applications of Natural Language to Information Systems, NLDB ... revised papers. International Conference on Applications of Natural Language to Info...","volume":"15836 ","pages":"49-63"},"PeriodicalIF":0.0000,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12367198/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural language processing and information systems : ... International Conference on Applications of Natural Language to Information Systems, NLDB ... revised papers. International Conference on Applications of Natural Language to Info...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-031-97141-9_4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
End-to-end relation extraction (E2ERE) is an important application of natural language processing (NLP) in biomedicine. The extracted relations populate knowledge graphs and drive more high level applications in knowledge discovery and information retrieval. E2ERE is frequently handled at the sentence level involving continuous entities. A more complex setting is document level E2ERE with discontinuous and overlapping/nested entities. We identified a recently introduced RE dataset for rare diseases (RareDis) that has these complex traits. Among current E2ERE methods, we see three well-known paradigms: (1) pipeline based approaches where a named entity recognition (NER) model's output is input to a relation classification (RC) model; (2) joint sequence-to-sequence style models where the raw input text is directly transformed into relations through linearization schemas; and (3) generative large language models (LLMs), where prompts, fine-tuning, and in-context learning are being leveraged for RE. While LLMs are becoming popular because of tools such as ChatGPT, the biomedical NLP community needs to carefully evaluate which paradigm is more suitable for E2ERE. In this effort, using the RareDis dataset as a complex use-case, we evaluate the best representative models from each of the three paradigms for E2ERE. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind. We verify these findings on a second E2ERE dataset for chemical-protein interactions. Although LLMs are more suitable for zero-shot settings, our results show that it is better to work with more conventional models trained and tailored for E2ERE when training data is available. Our contribution is also the first to conduct E2ERE for the RareDis dataset.