Aryana Far, Asal Bastani, Albert Lee, Oksana Gologorskaya, Chiung-Yu Huang, Mark J. Pletcher, Jennifer C. Lai, Jin Ge
{"title":"Evaluating the positive predictive value of code-based identification of cirrhosis and its complications utilizing GPT-4","authors":"Aryana Far, Asal Bastani, Albert Lee, Oksana Gologorskaya, Chiung-Yu Huang, Mark J. Pletcher, Jennifer C. Lai, Jin Ge","doi":"10.1097/hep.0000000000001115","DOIUrl":null,"url":null,"abstract":"Background: Diagnosis code classification is a common method for cohort identification in cirrhosis research, but it is often inaccurate and augmented by labor-intensive chart review. Natural language processing (NLP) using large language models (LLMs) is a potentially more accurate method. To assess LLMs’ potential for cirrhosis cohort identification, we compared code-based versus LLM-based classification with chart review as a “gold standard.” Methods: We extracted and conducted a limited chart review of 3,788 discharge summaries of cirrhosis admissions. We engineered zero-shot prompts using Generative Pre-trained Transformer (GPT)-4 to determine whether cirrhosis and its complications were active hospitalization problems. We calculated positive predictive values (PPVs) of LLM-based classification versus limited chart review, and PPVs of code-based versus LLM-based classification as a “silver standard” in all 3,788 summaries. Results: Versus gold standard chart review, code-based classification achieved PPVs of 82.2% for identifying cirrhosis, 41.7% hepatic encephalopathy, 72.8% ascites, 59.8% gastrointestinal bleeding, and 48.8% spontaneous bacterial peritonitis. Compared to chart review, GPT-4 achieved 87.8-98.8% accuracies for identifying cirrhosis and its complications. Using LLM as a silver standard, code-based classification achieved PPVs of 79.8% for identifying cirrhosis, 53.9% hepatic encephalopathy, 55.3% ascites, 67.6% gastrointestinal bleeding, and 65.5% spontaneous bacterial peritonitis. Conclusions: LLM-based classification was highly accurate versus manual chart review in identifying cirrhosis and its complications – this allowed us to assess the performance of code-based classification at scale using LLMs as a silver standard. These results suggest LLMs could augment or replace code-based cohort classification and raise questions regarding the necessity of chart review.","PeriodicalId":177,"journal":{"name":"Hepatology","volume":"13 1","pages":""},"PeriodicalIF":12.9000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hepatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/hep.0000000000001115","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Diagnosis code classification is a common method for cohort identification in cirrhosis research, but it is often inaccurate and augmented by labor-intensive chart review. Natural language processing (NLP) using large language models (LLMs) is a potentially more accurate method. To assess LLMs’ potential for cirrhosis cohort identification, we compared code-based versus LLM-based classification with chart review as a “gold standard.” Methods: We extracted and conducted a limited chart review of 3,788 discharge summaries of cirrhosis admissions. We engineered zero-shot prompts using Generative Pre-trained Transformer (GPT)-4 to determine whether cirrhosis and its complications were active hospitalization problems. We calculated positive predictive values (PPVs) of LLM-based classification versus limited chart review, and PPVs of code-based versus LLM-based classification as a “silver standard” in all 3,788 summaries. Results: Versus gold standard chart review, code-based classification achieved PPVs of 82.2% for identifying cirrhosis, 41.7% hepatic encephalopathy, 72.8% ascites, 59.8% gastrointestinal bleeding, and 48.8% spontaneous bacterial peritonitis. Compared to chart review, GPT-4 achieved 87.8-98.8% accuracies for identifying cirrhosis and its complications. Using LLM as a silver standard, code-based classification achieved PPVs of 79.8% for identifying cirrhosis, 53.9% hepatic encephalopathy, 55.3% ascites, 67.6% gastrointestinal bleeding, and 65.5% spontaneous bacterial peritonitis. Conclusions: LLM-based classification was highly accurate versus manual chart review in identifying cirrhosis and its complications – this allowed us to assess the performance of code-based classification at scale using LLMs as a silver standard. These results suggest LLMs could augment or replace code-based cohort classification and raise questions regarding the necessity of chart review.
期刊介绍:
HEPATOLOGY is recognized as the leading publication in the field of liver disease. It features original, peer-reviewed articles covering various aspects of liver structure, function, and disease. The journal's distinguished Editorial Board carefully selects the best articles each month, focusing on topics including immunology, chronic hepatitis, viral hepatitis, cirrhosis, genetic and metabolic liver diseases, liver cancer, and drug metabolism.