Harry Collin, Matthew J. Roberts, Kandice Keogh, Amila Siriwardana, Marnique Basto
{"title":"Improving clinical efficiency using retrieval-augmented generation in urologic oncology: A guideline-enhanced artificial intelligence approach","authors":"Harry Collin, Matthew J. Roberts, Kandice Keogh, Amila Siriwardana, Marnique Basto","doi":"10.1002/bco2.427","DOIUrl":null,"url":null,"abstract":"<p>Artificial intelligence (AI) in urology is evolving and has rapidly expanded since the release of ChatGPT and other large language models (LLMs). Early studies have found that AI-generated patient information is moderate to high quality for patient questions across multiple uro-oncology domains.<span><sup>1</sup></span> Extension into clinical decision-making suggests that ChatGPT can make decisions aligned with evidence-based medicine.<span><sup>2</sup></span></p><p>The key limitation of the publicly available ChatGPT (version 3.5) has been a reliance on knowledge confined to data published prior to September 2021.<span><sup>3</sup></span> Subscription-based ChatGPT Plus has since released ChatGPT 4.0, which is capable of web browsing. Equipped with the European Association of Urology (EAU) Guidelines, responses to urological queries are of higher quality,<span><sup>4</sup></span> and ChatGPT 4.0 has been shown to make complex medical decisions concordant with those discussed in multidisciplinary team meetings.<span><sup>5</sup></span> Furthermore, ChatGPT 4.0 allows LLMs to be curated for a specific task with retrieval-augmented generation (RAG), whereby a Generative Pre-trained Transformer (GPT) can use additional context provided by specialised materials to improve the accuracy of responses. Such advancements may realise the potential of AI systems to further urology practice via incorporation of up-to-date and highly specialised knowledge.</p><p>For instance, post-treatment imaging surveillance for renal cell carcinoma (RCC) is a common yet challenging clinical scenario due to disconcordance between histopathological diversity and guideline algorithms. The EAU Guidelines offer structured recommendations for follow-up imaging surveillance, which is pivotal for timely detection of cancer recurrence. However, effective clinical application of these guidelines requires specialised understanding of histopathology and can often be repetitive and time-consuming particularly when assigned to junior doctors.</p><p>This study aimed to test a GPT customised with RAG to interpret post-nephrectomy RCC histopathology reports and determine recommended follow-up surveillance imaging according to EAU Guidelines.</p><p>A RAG system was created using ChatGPT 4.0 (OpenAI via ChatGPT Plus, https://chat.openai.com/gpts/editor). The 2023 EAU Guidelines on RCC were uploaded to the GPT including Chapter 3 (Epidemiology, Aetiology and Pathology), Chapter 4 (Staging and Classification Systems) and Chapter 8 (Follow-Up in RCC). Code Interpreter capabilities were enabled, which allows the GPT to retrieve uploaded files and analyse data. Web browsing was disabled. No formal coding training or experience was required.</p><p>All instructions for the GPT were in free text (shown in Table S1). Instructions were written as three steps: interpret histopathology, determine surveillance regimen and output. The GPT was provided with clear directives to determine the risk profile (low, intermediate, or high risk) according to the EAU Guidelines, which uses Leibovich score for clear cell RCC (ccRCC) or histopathological stage and grade for non-ccRCC. The GPT was then instructed to recommend a follow-up imaging surveillance regimen, based on a template relative to risk profile (EAU Guidelines, tab. 8.1).</p><p>Simulated histopathology reports were created to represent all possible risk profiles (low, intermediate and high) relevant to the EAU Guidelines for the most common RCC subtypes—ccRCC, papillary (pRCC) and chromophobe (chRCC) tumours. Reports were structured according to International Society of Urological Pathology (ISUP) guidelines. Per the NHMRC National Statement on Ethical Conduct in Human Research 2023, this study did not require ethics committee approval as it utilised theoretical cases and does not meet the definition of human or animal research. Each report was input to the custom GPT on 15 January 2024. Responses were reviewed by two board-certified urologists for concordance with the EAU Guidelines.</p><p>Full histopathology reports and their raw outputs are shown in Table S2. Results are summarised in Figure 1, and concordance of custom GPT outputs with the EAU Guidelines for each simulated histopathology report is shown in Figure S3.</p><p>The custom GPT correctly determined all RCC risk profiles. All but one Leibovich score was correct (a 72 mm lesion was incorrectly defined as a tumour greater than 1 cm). All outputs recommended surveillance scans at the post-treatment intervals outlined in the EAU Guidelines. Three of the eight surveillance regimens (38%) were precisely concordant, while the remaining five contained additional imaging (3-month imaging for intermediate risk and 30-month imaging for high risk). The custom GPT proposed 2-yearly scans beyond 5 years for low-risk pRCC, contrary to the EAU Guidelines that suggest no further surveillance.</p><p>Most (6/8) recommendations specified imaging modality and body region (CT Chest and Abdomen). All outputs stated a guideline basis for the recommendation. Some outputs recommended renal and cardiovascular monitoring, which is mentioned in Chapter 8 of the EAU Guidelines, despite no specific prompting.</p><p>This study demonstrates the initial potential of RAG AI systems with integrated clinical guidelines to interpret results and make recommendations. This novel approach indicates the capacity of custom GPTs to handle complex and algorithmic tasks, which are often time-consuming and prone to human error. The surveillance regimens (38% concordance) generated under tailored instructions may show improvement over non-specialised ChatGPT 4 outputs, which lack focused access to specific guidelines and resulted in only 26% guideline concordance for prostate cancer.<span><sup>6</sup></span> Our results also compare favourably to studies using web-enabled ChatGPT 4.0 where 27% of responses to questions on kidney cancer, adapted from the EAU Guidelines, were of excellent quality.<span><sup>7</sup></span> Surveillance regimens were consistently safe, with no missing interval scans and additional scans in 62% of regimens, indicating a cautious approach.</p><p>AI supplementation of specialised urology knowledge has previously required programming skills.<span><sup>4, 8</sup></span> In this novel approach with a RAG design, we showed accurate, safe outputs from free text instructions without prior coding training. Consequently, ChatGPT 4.0 enables medical professionals to combine highly specialised knowledge with AI to enhance their clinical practice to their needs.</p><p>The inevitable introduction of AI to clinical settings must be met with close oversight from clinicians, especially when nuanced clinical judgements are involved. Here, the custom GPT miscalculated the Leibovich score from one theoretical report and could not consistently translate risk profile into a precisely concordant surveillance regimen. Conversely, unprompted recommendations promoted individualised care, such as renal and cardiovascular monitoring, so insights into comprehensive interpretation to enhance clinical practice were present.</p><p>A limitation of this study was partial inclusion of a single international guideline, which, despite being endorsed by 75 international societies, may limit GPT comprehensive integration. Future studies and AI developments could consider other guidelines (e.g., American Urological Association, National Institute for Health and Care Excellence). Ideally, development of a multi-guideline framework that utilises a decision-tree methodology could harness the power of AI to select and synthesise the most pertinent guidelines relevant to the local jurisdiction (limiting conflicts) or patient preferences for individualised care. Furthermore, web browsing was disabled to focus the AI on the uploaded guidelines but may have limited wider information access and integration, potentially affecting its handling of complex queries. Additionally, while the small series of histopathology report aimed to mitigate the custom GPT analysis complexity, future expansion of source number and content, as well as a training set, may further assess and enhance the custom GPT capabilities.</p><p>In conclusion, this focused evaluation of a GPT with integrated clinical guidelines illustrated the potential of AI, particularly RAG systems, for decision-making accuracy across the most common histopathological subtypes. Future incorporation may streamline clinical workflows and decision-making, but only with further evaluation, and cautious integration to ensure that these systems augment, not replace, clinician-directed personalised evidence-based care.</p><p>There are no conflicts of interest.</p>","PeriodicalId":72420,"journal":{"name":"BJUI compass","volume":"6 1","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771495/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJUI compass","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bco2.427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) in urology is evolving and has rapidly expanded since the release of ChatGPT and other large language models (LLMs). Early studies have found that AI-generated patient information is moderate to high quality for patient questions across multiple uro-oncology domains.1 Extension into clinical decision-making suggests that ChatGPT can make decisions aligned with evidence-based medicine.2
The key limitation of the publicly available ChatGPT (version 3.5) has been a reliance on knowledge confined to data published prior to September 2021.3 Subscription-based ChatGPT Plus has since released ChatGPT 4.0, which is capable of web browsing. Equipped with the European Association of Urology (EAU) Guidelines, responses to urological queries are of higher quality,4 and ChatGPT 4.0 has been shown to make complex medical decisions concordant with those discussed in multidisciplinary team meetings.5 Furthermore, ChatGPT 4.0 allows LLMs to be curated for a specific task with retrieval-augmented generation (RAG), whereby a Generative Pre-trained Transformer (GPT) can use additional context provided by specialised materials to improve the accuracy of responses. Such advancements may realise the potential of AI systems to further urology practice via incorporation of up-to-date and highly specialised knowledge.
For instance, post-treatment imaging surveillance for renal cell carcinoma (RCC) is a common yet challenging clinical scenario due to disconcordance between histopathological diversity and guideline algorithms. The EAU Guidelines offer structured recommendations for follow-up imaging surveillance, which is pivotal for timely detection of cancer recurrence. However, effective clinical application of these guidelines requires specialised understanding of histopathology and can often be repetitive and time-consuming particularly when assigned to junior doctors.
This study aimed to test a GPT customised with RAG to interpret post-nephrectomy RCC histopathology reports and determine recommended follow-up surveillance imaging according to EAU Guidelines.
A RAG system was created using ChatGPT 4.0 (OpenAI via ChatGPT Plus, https://chat.openai.com/gpts/editor). The 2023 EAU Guidelines on RCC were uploaded to the GPT including Chapter 3 (Epidemiology, Aetiology and Pathology), Chapter 4 (Staging and Classification Systems) and Chapter 8 (Follow-Up in RCC). Code Interpreter capabilities were enabled, which allows the GPT to retrieve uploaded files and analyse data. Web browsing was disabled. No formal coding training or experience was required.
All instructions for the GPT were in free text (shown in Table S1). Instructions were written as three steps: interpret histopathology, determine surveillance regimen and output. The GPT was provided with clear directives to determine the risk profile (low, intermediate, or high risk) according to the EAU Guidelines, which uses Leibovich score for clear cell RCC (ccRCC) or histopathological stage and grade for non-ccRCC. The GPT was then instructed to recommend a follow-up imaging surveillance regimen, based on a template relative to risk profile (EAU Guidelines, tab. 8.1).
Simulated histopathology reports were created to represent all possible risk profiles (low, intermediate and high) relevant to the EAU Guidelines for the most common RCC subtypes—ccRCC, papillary (pRCC) and chromophobe (chRCC) tumours. Reports were structured according to International Society of Urological Pathology (ISUP) guidelines. Per the NHMRC National Statement on Ethical Conduct in Human Research 2023, this study did not require ethics committee approval as it utilised theoretical cases and does not meet the definition of human or animal research. Each report was input to the custom GPT on 15 January 2024. Responses were reviewed by two board-certified urologists for concordance with the EAU Guidelines.
Full histopathology reports and their raw outputs are shown in Table S2. Results are summarised in Figure 1, and concordance of custom GPT outputs with the EAU Guidelines for each simulated histopathology report is shown in Figure S3.
The custom GPT correctly determined all RCC risk profiles. All but one Leibovich score was correct (a 72 mm lesion was incorrectly defined as a tumour greater than 1 cm). All outputs recommended surveillance scans at the post-treatment intervals outlined in the EAU Guidelines. Three of the eight surveillance regimens (38%) were precisely concordant, while the remaining five contained additional imaging (3-month imaging for intermediate risk and 30-month imaging for high risk). The custom GPT proposed 2-yearly scans beyond 5 years for low-risk pRCC, contrary to the EAU Guidelines that suggest no further surveillance.
Most (6/8) recommendations specified imaging modality and body region (CT Chest and Abdomen). All outputs stated a guideline basis for the recommendation. Some outputs recommended renal and cardiovascular monitoring, which is mentioned in Chapter 8 of the EAU Guidelines, despite no specific prompting.
This study demonstrates the initial potential of RAG AI systems with integrated clinical guidelines to interpret results and make recommendations. This novel approach indicates the capacity of custom GPTs to handle complex and algorithmic tasks, which are often time-consuming and prone to human error. The surveillance regimens (38% concordance) generated under tailored instructions may show improvement over non-specialised ChatGPT 4 outputs, which lack focused access to specific guidelines and resulted in only 26% guideline concordance for prostate cancer.6 Our results also compare favourably to studies using web-enabled ChatGPT 4.0 where 27% of responses to questions on kidney cancer, adapted from the EAU Guidelines, were of excellent quality.7 Surveillance regimens were consistently safe, with no missing interval scans and additional scans in 62% of regimens, indicating a cautious approach.
AI supplementation of specialised urology knowledge has previously required programming skills.4, 8 In this novel approach with a RAG design, we showed accurate, safe outputs from free text instructions without prior coding training. Consequently, ChatGPT 4.0 enables medical professionals to combine highly specialised knowledge with AI to enhance their clinical practice to their needs.
The inevitable introduction of AI to clinical settings must be met with close oversight from clinicians, especially when nuanced clinical judgements are involved. Here, the custom GPT miscalculated the Leibovich score from one theoretical report and could not consistently translate risk profile into a precisely concordant surveillance regimen. Conversely, unprompted recommendations promoted individualised care, such as renal and cardiovascular monitoring, so insights into comprehensive interpretation to enhance clinical practice were present.
A limitation of this study was partial inclusion of a single international guideline, which, despite being endorsed by 75 international societies, may limit GPT comprehensive integration. Future studies and AI developments could consider other guidelines (e.g., American Urological Association, National Institute for Health and Care Excellence). Ideally, development of a multi-guideline framework that utilises a decision-tree methodology could harness the power of AI to select and synthesise the most pertinent guidelines relevant to the local jurisdiction (limiting conflicts) or patient preferences for individualised care. Furthermore, web browsing was disabled to focus the AI on the uploaded guidelines but may have limited wider information access and integration, potentially affecting its handling of complex queries. Additionally, while the small series of histopathology report aimed to mitigate the custom GPT analysis complexity, future expansion of source number and content, as well as a training set, may further assess and enhance the custom GPT capabilities.
In conclusion, this focused evaluation of a GPT with integrated clinical guidelines illustrated the potential of AI, particularly RAG systems, for decision-making accuracy across the most common histopathological subtypes. Future incorporation may streamline clinical workflows and decision-making, but only with further evaluation, and cautious integration to ensure that these systems augment, not replace, clinician-directed personalised evidence-based care.