Mike Zack, Ioan Slobodchikov, Danil Stupichev, Alex Moore, David Sokolov, Igor Trifonov, Allan Gobbs
{"title":"Benchmarking large language models for replication of guideline-based PGx recommendations.","authors":"Mike Zack, Ioan Slobodchikov, Danil Stupichev, Alex Moore, David Sokolov, Igor Trifonov, Allan Gobbs","doi":"10.1038/s41397-025-00383-0","DOIUrl":null,"url":null,"abstract":"<p><p>We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene-drug-phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.</p>","PeriodicalId":54624,"journal":{"name":"Pharmacogenomics Journal","volume":"25 4","pages":"23"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pharmacogenomics Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41397-025-00383-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene-drug-phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.
期刊介绍:
The Pharmacogenomics Journal is a print and electronic journal, which is dedicated to the rapid publication of original research on pharmacogenomics and its clinical applications.
Key areas of coverage include:
Personalized medicine
Effects of genetic variability on drug toxicity and efficacy
Identification and functional characterization of polymorphisms relevant to drug action
Pharmacodynamic and pharmacokinetic variations and drug efficacy
Integration of new developments in the genome project and proteomics into clinical medicine, pharmacology, and therapeutics
Clinical applications of genomic science
Identification of novel genomic targets for drug development
Potential benefits of pharmacogenomics.