Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data.

IF 2.8 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2025-09-01 Epub Date: 2025-09-05 DOI:10.1200/CCI-25-00011

Kun-Han Lu, Sina Mehdinia, Kingson Man, Chi Wah Wong, Allen Mao, Zahra Eftekhari

{"title":"Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data.","authors":"Kun-Han Lu, Sina Mehdinia, Kingson Man, Chi Wah Wong, Allen Mao, Zahra Eftekhari","doi":"10.1200/CCI-25-00011","DOIUrl":null,"url":null,"abstract":"Purpose: The recent advancements of retrieval-augmented generation (RAG) and large language models (LLMs) have revolutionized the extraction of real-world evidence from unstructured electronic health records (EHRs) in oncology. This study aims to enhance RAG's effectiveness by implementing a retriever encoder specifically designed for oncology EHRs, with the goal of improving the precision and relevance of retrieved clinical notes for oncology-related queries.Methods: Our model was pretrained with more than six million oncology notes from 209,135 patients at City of Hope. The model was subsequently fine-tuned into a sentence transformer model using 12,371 query-passage training pairs. Specifically, the passages were obtained from actual patient notes, whereas the query was synthesized by an LLM. We evaluated the retrieval performance of our model by comparing it with six widely used embedding models on 50 oncology questions across 10 categories based on Normalized Discounted Cumulative Gain (NDCG), Precision, and Recall.Results: In our test data set comprising 53 patients, our model exceeded the performance of the runner-up model by 9% for NDCG (evaluated at the top 10 results), 7% for Precision (top 10), and 6% for Recall (top 10). Our model showed exceptional retrieval performance across all metrics for oncology-specific categories, including biomarkers assessed, current diagnosis, disease status, laboratory results, tumor characteristics, and tumor staging.Conclusion: Our findings highlight the effectiveness of pretrained contextual embeddings and sentence transformers in retrieving pertinent notes from oncology EHRs. The innovative use of LLM-synthesized query-passage pairs for data augmentation was proven to be effective. This fine-tuning approach holds significant promise in specialized fields like health care, where acquiring annotated data is challenging.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2500011"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-25-00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/5 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: The recent advancements of retrieval-augmented generation (RAG) and large language models (LLMs) have revolutionized the extraction of real-world evidence from unstructured electronic health records (EHRs) in oncology. This study aims to enhance RAG's effectiveness by implementing a retriever encoder specifically designed for oncology EHRs, with the goal of improving the precision and relevance of retrieved clinical notes for oncology-related queries.

Methods: Our model was pretrained with more than six million oncology notes from 209,135 patients at City of Hope. The model was subsequently fine-tuned into a sentence transformer model using 12,371 query-passage training pairs. Specifically, the passages were obtained from actual patient notes, whereas the query was synthesized by an LLM. We evaluated the retrieval performance of our model by comparing it with six widely used embedding models on 50 oncology questions across 10 categories based on Normalized Discounted Cumulative Gain (NDCG), Precision, and Recall.

Results: In our test data set comprising 53 patients, our model exceeded the performance of the runner-up model by 9% for NDCG (evaluated at the top 10 results), 7% for Precision (top 10), and 6% for Recall (top 10). Our model showed exceptional retrieval performance across all metrics for oncology-specific categories, including biomarkers assessed, current diagnosis, disease status, laboratory results, tumor characteristics, and tumor staging.

Conclusion: Our findings highlight the effectiveness of pretrained contextual embeddings and sentence transformers in retrieving pertinent notes from oncology EHRs. The innovative use of LLM-synthesized query-passage pairs for data augmentation was proven to be effective. This fine-tuning approach holds significant promise in specialized fields like health care, where acquiring annotated data is challenging.

查看原文本刊更多论文

通过合成数据的微调嵌入，用大型语言模型增强肿瘤特定问题的回答。

目的：检索增强生成（RAG）和大型语言模型（llm）的最新进展彻底改变了肿瘤学中从非结构化电子健康记录（EHRs）中提取真实世界证据的方法。本研究旨在通过实现一个专门为肿瘤电子病历设计的检索编码器来提高RAG的有效性，目的是提高肿瘤相关查询检索临床记录的准确性和相关性。方法：我们的模型是用来自希望之城209,135名患者的600多万份肿瘤笔记进行预训练的。该模型随后使用12,371个查询通道训练对微调为句子转换模型。具体来说，这些段落是从实际的病人笔记中获得的，而查询是由LLM合成的。基于归一化贴现累积增益（NDCG）、精度和召回率，我们通过将模型与六种广泛使用的嵌入模型在10个类别的50个肿瘤学问题上进行比较，评估了模型的检索性能。结果：在包含53名患者的测试数据集中，我们的模型在NDCG（以前10名的结果进行评估）、精度（前10名）和召回率（前10名）方面的表现分别超过了第二名模型9%、7%和6%。我们的模型在肿瘤特定类别的所有指标上都显示出卓越的检索性能，包括评估的生物标志物、当前诊断、疾病状态、实验室结果、肿瘤特征和肿瘤分期。结论：我们的研究结果强调了预先训练的上下文嵌入和句子转换在从肿瘤学电子病历中检索相关笔记方面的有效性。将llm合成的查询通道对创新地用于数据增强已被证明是有效的。这种微调方法在医疗保健等专业领域具有重要的前景，在这些领域获取带注释的数据具有挑战性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190