对大型语言模型gpt - 40、llama 3.1和qwen 2.5进行癌症遗传变异分类的基准测试。

IF 6.8 1区医学 Q1 ONCOLOGY

NPJ Precision Oncology Pub Date : 2025-05-15 DOI:10.1038/s41698-025-00935-4

Kuan-Hsun Lin, Tzu-Hang Kao, Lei-Chi Wang, Chen-Tsung Kuo, Paul Chih-Hsueh Chen, Yuan-Chia Chu, Yi-Chen Yeh

{"title":"对大型语言模型gpt - 40、llama 3.1和qwen 2.5进行癌症遗传变异分类的基准测试。","authors":"Kuan-Hsun Lin, Tzu-Hang Kao, Lei-Chi Wang, Chen-Tsung Kuo, Paul Chih-Hsueh Chen, Yuan-Chia Chu, Yi-Chen Yeh","doi":"10.1038/s41698-025-00935-4","DOIUrl":null,"url":null,"abstract":"Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.","PeriodicalId":19433,"journal":{"name":"NPJ Precision Oncology","volume":"9 1","pages":"141"},"PeriodicalIF":6.8000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12078457/pdf/","citationCount":"0","resultStr":"{\"title\":\"Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification.\",\"authors\":\"Kuan-Hsun Lin, Tzu-Hang Kao, Lei-Chi Wang, Chen-Tsung Kuo, Paul Chih-Hsueh Chen, Yuan-Chia Chu, Yi-Chen Yeh\",\"doi\":\"10.1038/s41698-025-00935-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.\",\"PeriodicalId\":19433,\"journal\":{\"name\":\"NPJ Precision Oncology\",\"volume\":\"9 1\",\"pages\":\"141\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2025-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12078457/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NPJ Precision Oncology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1038/s41698-025-00935-4\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Precision Oncology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41698-025-00935-4","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

在精确肿瘤学中，基于临床可操作性对癌症基因变异进行分类是至关重要的，但也是具有挑战性的。大型语言模型（llm）提供了潜在的解决方案，但它们的性能仍未得到充分研究。本研究评估了gpt - 40、Llama 3.1和Qwen 2.5对来自OncoKB和CIViC数据库的遗传变异的分类，以及来自FoundationOne CDx报告的真实数据集。gpt - 40在区分临床相关变异和未知临床意义变异（VUS）方面的准确率最高（0.7318），优于Qwen 2.5（0.5731）和Llama 3.1（0.4976）。对于临床证据较强的变异，llm与专家注释表现出更好的一致性，但对于那些证据较弱的变异，llm表现出更大的不一致性。这三种模型都倾向于将变量分配给更高的证据水平，这表明存在过度分类的倾向。提示工程显著提高了准确性，而检索增强生成（RAG）进一步提高了性能。经过100次迭代的稳定性分析显示，与CIViC系统相比，与OncoKB具有更高的一致性。这些发现突出了llm在癌症遗传变异分类方面的前景，同时也强调了进一步优化以提高准确性、一致性和临床适用性的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification.

Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NPJ Precision Oncology ONCOLOGY-

CiteScore

9.90

自引率

1.30%

发文量

审稿时长

18 weeks

期刊介绍： Online-only and open access, npj Precision Oncology is an international, peer-reviewed journal dedicated to showcasing cutting-edge scientific research in all facets of precision oncology, spanning from fundamental science to translational applications and clinical medicine.