Stress testing deep learning models for prostate cancer detection on biopsies and surgical specimens

IF 5.6 2区医学 Q1 ONCOLOGY

The Journal of Pathology Pub Date : 2024-12-11 DOI:10.1002/path.6373

Brennan T Flannery, Howard M Sandler, Priti Lal, Michael D Feldman, Juan C Santa-Rosario, Tilak Pathak, Tuomas Mirtti, Xavier Farre, Rohann Correa, Susan Chafe, Amit Shah, Jason A Efstathiou, Karen Hoffman, Mark A Hallman, Michael Straza, Richard Jordan, Stephanie L Pugh, Felix Feng, Anant Madabhushi

{"title":"Stress testing deep learning models for prostate cancer detection on biopsies and surgical specimens","authors":"Brennan T Flannery, Howard M Sandler, Priti Lal, Michael D Feldman, Juan C Santa-Rosario, Tilak Pathak, Tuomas Mirtti, Xavier Farre, Rohann Correa, Susan Chafe, Amit Shah, Jason A Efstathiou, Karen Hoffman, Mark A Hallman, Michael Straza, Richard Jordan, Stephanie L Pugh, Felix Feng, Anant Madabhushi","doi":"10.1002/path.6373","DOIUrl":null,"url":null,"abstract":"The presence, location, and extent of prostate cancer is assessed by pathologists using H&E-stained tissue slides. Machine learning approaches can accomplish these tasks for both biopsies and radical prostatectomies. Deep learning approaches using convolutional neural networks (CNNs) have been shown to identify cancer in pathologic slides, some securing regulatory approval for clinical use. However, differences in sample processing can subtly alter the morphology between sample types, making it unclear whether deep learning algorithms will consistently work on both types of slide images. Our goal was to investigate whether morphological differences between sample types affected the performance of biopsy-trained cancer detection CNN models when applied to radical prostatectomies and vice versa using multiple cohorts (N = 1,000). Radical prostatectomies (N = 100) and biopsies (N = 50) were acquired from The University of Pennsylvania to train (80%) and validate (20%) a DenseNet CNN for biopsies (MB), radical prostatectomies (MR), and a combined dataset (MB+R). On a tile level, MB and MR achieved F1 scores greater than 0.88 when applied to their own sample type but less than 0.65 when applied across sample types. On a whole-slide level, models achieved significantly better performance on their own sample type compared to the alternative model (p < 0.05) for all metrics. This was confirmed by external validation using digitized biopsy slide images from a clinical trial [NRG Radiation Therapy Oncology Group (RTOG)] (NRG/RTOG 0521, N = 750) via both qualitative and quantitative analyses (p < 0.05). A comprehensive review of model outputs revealed morphologically driven decision making that adversely affected model performance. MB appeared to be challenged with the analysis of open gland structures, whereas MR appeared to be challenged with closed gland structures, indicating potential morphological variation between the training sets. These findings suggest that differences in morphology and heterogeneity necessitate the need for more tailored, sample-specific (i.e. biopsy and surgical) machine learning models. © 2024 The Author(s). The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.","PeriodicalId":232,"journal":{"name":"The Journal of Pathology","volume":"265 2","pages":"146-157"},"PeriodicalIF":5.6000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11717490/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Pathology","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/path.6373","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The presence, location, and extent of prostate cancer is assessed by pathologists using H&E-stained tissue slides. Machine learning approaches can accomplish these tasks for both biopsies and radical prostatectomies. Deep learning approaches using convolutional neural networks (CNNs) have been shown to identify cancer in pathologic slides, some securing regulatory approval for clinical use. However, differences in sample processing can subtly alter the morphology between sample types, making it unclear whether deep learning algorithms will consistently work on both types of slide images. Our goal was to investigate whether morphological differences between sample types affected the performance of biopsy-trained cancer detection CNN models when applied to radical prostatectomies and vice versa using multiple cohorts (N = 1,000). Radical prostatectomies (N = 100) and biopsies (N = 50) were acquired from The University of Pennsylvania to train (80%) and validate (20%) a DenseNet CNN for biopsies (M^B), radical prostatectomies (M^R), and a combined dataset (M^B+R). On a tile level, M^B and M^R achieved F1 scores greater than 0.88 when applied to their own sample type but less than 0.65 when applied across sample types. On a whole-slide level, models achieved significantly better performance on their own sample type compared to the alternative model (p < 0.05) for all metrics. This was confirmed by external validation using digitized biopsy slide images from a clinical trial [NRG Radiation Therapy Oncology Group (RTOG)] (NRG/RTOG 0521, N = 750) via both qualitative and quantitative analyses (p < 0.05). A comprehensive review of model outputs revealed morphologically driven decision making that adversely affected model performance. M^B appeared to be challenged with the analysis of open gland structures, whereas M^R appeared to be challenged with closed gland structures, indicating potential morphological variation between the training sets. These findings suggest that differences in morphology and heterogeneity necessitate the need for more tailored, sample-specific (i.e. biopsy and surgical) machine learning models. © 2024 The Author(s). The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.

Abstract Image

查看原文本刊更多论文

压力测试用于前列腺癌活检和手术标本检测的深度学习模型。

病理学家使用h&e染色的组织切片来评估前列腺癌的存在、位置和程度。机器学习方法可以在活检和根治性前列腺切除术中完成这些任务。使用卷积神经网络（cnn）的深度学习方法已被证明可以在病理切片中识别癌症，其中一些已获得监管机构批准用于临床应用。然而，样本处理的差异会微妙地改变样本类型之间的形态，这使得人们不清楚深度学习算法是否能始终适用于两种类型的幻灯片图像。我们的目的是通过多个队列（N = 1000）研究样本类型之间的形态学差异是否会影响活检训练的癌症检测CNN模型在根治性前列腺切除术中的表现，反之亦然。从宾夕法尼亚大学获得根治性前列腺切除术（N = 100）和活检（N = 50），以训练（80%）和验证（20%）DenseNet CNN用于活检（MB），根治性前列腺切除术（MR）和组合数据集（MB+R）。在瓷砖水平上，MB和MR在应用于其自身样本类型时获得的F1分数大于0.88，但在应用于跨样本类型时则小于0.65。在整个幻灯片水平上，与替代模型相比，模型在自己的样本类型上取得了显着更好的性能(p B似乎受到开放腺体结构分析的挑战，而MR似乎受到封闭腺体结构的挑战，这表明训练集之间存在潜在的形态学差异。这些发现表明，形态学和异质性的差异需要更定制的、样本特异性的（即活检和手术）机器学习模型。©2024作者。《病理学杂志》由John Wiley & Sons Ltd代表大不列颠和爱尔兰病理学会出版。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Journal of Pathology 医学-病理学

CiteScore

14.10

自引率

1.40%

发文量

144

审稿时长

3-8 weeks

期刊介绍： The Journal of Pathology aims to serve as a translational bridge between basic biomedical science and clinical medicine with particular emphasis on, but not restricted to, tissue based studies. The main interests of the Journal lie in publishing studies that further our understanding the pathophysiological and pathogenetic mechanisms of human disease. The Journal of Pathology welcomes investigative studies on human tissues, in vitro and in vivo experimental studies, and investigations based on animal models with a clear relevance to human disease, including transgenic systems. As well as original research papers, the Journal seeks to provide rapid publication in a variety of other formats, including editorials, review articles, commentaries and perspectives and other features, both contributed and solicited.