Khrystyna Faryna , Leslie Tessier , Juan Retamero , Saikiran Bonthu , Pranab Samanta , Nitin Singhal , Solene-Florence Kammerer-Jacquet , Camelia Radulescu , Vittorio Agosti , Alexandre Collin , Xavier Farre´ , Jacqueline Fontugne , Rainer Grobholz , Agnes Marije Hoogland , Katia Ramos Moreira Leite , Murat Oktay , Antonio Polonia , Paromita Roy , Paulo Guilherme Salles , Theodorus H. van der Kwast , Geert Litjens
{"title":"Evaluation of Artificial Intelligence-Based Gleason Grading Algorithms “in the Wild”","authors":"Khrystyna Faryna , Leslie Tessier , Juan Retamero , Saikiran Bonthu , Pranab Samanta , Nitin Singhal , Solene-Florence Kammerer-Jacquet , Camelia Radulescu , Vittorio Agosti , Alexandre Collin , Xavier Farre´ , Jacqueline Fontugne , Rainer Grobholz , Agnes Marije Hoogland , Katia Ramos Moreira Leite , Murat Oktay , Antonio Polonia , Paromita Roy , Paulo Guilherme Salles , Theodorus H. van der Kwast , Geert Litjens","doi":"10.1016/j.modpat.2024.100563","DOIUrl":null,"url":null,"abstract":"<div><p>The biopsy Gleason score is an important prognostic marker for prostate cancer patients. It is, however, subject to substantial variability among pathologists. Artificial intelligence (AI)–based algorithms employing deep learning have shown their ability to match pathologists’ performance in assigning Gleason scores, with the potential to enhance pathologists’ grading accuracy. The performance of Gleason AI algorithms in research is mostly reported on common benchmark data sets or within public challenges. In contrast, many commercial algorithms are evaluated in clinical studies, for which data are not publicly released. As commercial AI vendors typically do not publish performance on public benchmarks, comparison between research and commercial AI is difficult. The aims of this study are to evaluate and compare the performance of top-ranked public and commercial algorithms using real-world data. We curated a diverse data set of whole-slide prostate biopsy images through crowdsourcing containing images with a range of Gleason scores and from diverse sources. Predictions were obtained from 5 top-ranked public algorithms from the Prostate cANcer graDe Assessment (PANDA) challenge and 2 commercial Gleason grading algorithms. Additionally, 10 pathologists (A.C., C.R., J.v.I., K.R.M.L., P.R., P.G.S., R.G., S.F.K.J., T.v.d.K., X.F.) evaluated the data set in a reader study. Overall, the pairwise quadratic weighted kappa among pathologists ranged from 0.777 to 0.916. Both public and commercial algorithms showed high agreement with pathologists, with quadratic kappa ranging from 0.617 to 0.900. Commercial algorithms performed on par or outperformed top public algorithms.</p></div>","PeriodicalId":18706,"journal":{"name":"Modern Pathology","volume":"37 11","pages":"Article 100563"},"PeriodicalIF":7.1000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0893395224001431/pdfft?md5=a7bc615d50d52e0c3a0d705eed9aa9d5&pid=1-s2.0-S0893395224001431-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Modern Pathology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893395224001431","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PATHOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
The biopsy Gleason score is an important prognostic marker for prostate cancer patients. It is, however, subject to substantial variability among pathologists. Artificial intelligence (AI)–based algorithms employing deep learning have shown their ability to match pathologists’ performance in assigning Gleason scores, with the potential to enhance pathologists’ grading accuracy. The performance of Gleason AI algorithms in research is mostly reported on common benchmark data sets or within public challenges. In contrast, many commercial algorithms are evaluated in clinical studies, for which data are not publicly released. As commercial AI vendors typically do not publish performance on public benchmarks, comparison between research and commercial AI is difficult. The aims of this study are to evaluate and compare the performance of top-ranked public and commercial algorithms using real-world data. We curated a diverse data set of whole-slide prostate biopsy images through crowdsourcing containing images with a range of Gleason scores and from diverse sources. Predictions were obtained from 5 top-ranked public algorithms from the Prostate cANcer graDe Assessment (PANDA) challenge and 2 commercial Gleason grading algorithms. Additionally, 10 pathologists (A.C., C.R., J.v.I., K.R.M.L., P.R., P.G.S., R.G., S.F.K.J., T.v.d.K., X.F.) evaluated the data set in a reader study. Overall, the pairwise quadratic weighted kappa among pathologists ranged from 0.777 to 0.916. Both public and commercial algorithms showed high agreement with pathologists, with quadratic kappa ranging from 0.617 to 0.900. Commercial algorithms performed on par or outperformed top public algorithms.
期刊介绍:
Modern Pathology, an international journal under the ownership of The United States & Canadian Academy of Pathology (USCAP), serves as an authoritative platform for publishing top-tier clinical and translational research studies in pathology.
Original manuscripts are the primary focus of Modern Pathology, complemented by impactful editorials, reviews, and practice guidelines covering all facets of precision diagnostics in human pathology. The journal's scope includes advancements in molecular diagnostics and genomic classifications of diseases, breakthroughs in immune-oncology, computational science, applied bioinformatics, and digital pathology.