NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful

Andrey Ignatov, Grigory Malivenko
{"title":"NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful","authors":"Andrey Ignatov, Grigory Malivenko","doi":"arxiv-2409.11546","DOIUrl":null,"url":null,"abstract":"Numerous deep learning-based solutions have been proposed for\nhistopathological image analysis over the past years. While they usually\ndemonstrate exceptionally high accuracy, one key question is whether their\nprecision might be affected by low-level image properties not related to\nhistopathology but caused by microscopy image handling and pre-processing. In\nthis paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used\nin numerous prior works and show that both this dataset and the obtained\nresults may be affected by data-specific biases. The most prominent revealed\ndataset issues are inappropriate color normalization, severe JPEG artifacts\ninconsistent between different classes, and completely corrupted tissue samples\nresulting from incorrect image dynamic range handling. We show that even the\nsimplest model using only 3 features per image (red, green and blue color\nintensities) can demonstrate over 50% accuracy on this 9-class dataset, while\nusing color histogram not explicitly capturing cell morphology features yields\nover 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet\npretrained model can achieve over 97.7% accuracy on this dataset, outperforming\nall previously proposed solutions developed for this task, including dedicated\nfoundation histopathological models and large cell morphology-aware neural\nnetworks. The NCT-CRC-HE dataset is publicly available and can be freely used\nto replicate the presented results. The codes and pre-trained models used in\nthis paper are available at\nhttps://github.com/gmalivenko/NCT-CRC-HE-experiments","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Image and Video Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Numerous deep learning-based solutions have been proposed for histopathological image analysis over the past years. While they usually demonstrate exceptionally high accuracy, one key question is whether their precision might be affected by low-level image properties not related to histopathology but caused by microscopy image handling and pre-processing. In this paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used in numerous prior works and show that both this dataset and the obtained results may be affected by data-specific biases. The most prominent revealed dataset issues are inappropriate color normalization, severe JPEG artifacts inconsistent between different classes, and completely corrupted tissue samples resulting from incorrect image dynamic range handling. We show that even the simplest model using only 3 features per image (red, green and blue color intensities) can demonstrate over 50% accuracy on this 9-class dataset, while using color histogram not explicitly capturing cell morphology features yields over 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet pretrained model can achieve over 97.7% accuracy on this dataset, outperforming all previously proposed solutions developed for this task, including dedicated foundation histopathological models and large cell morphology-aware neural networks. The NCT-CRC-HE dataset is publicly available and can be freely used to replicate the presented results. The codes and pre-trained models used in this paper are available at https://github.com/gmalivenko/NCT-CRC-HE-experiments
NCT-CRC-HE:并非所有组织病理学数据集都同样有用
在过去几年中,针对组织病理学图像分析提出了许多基于深度学习的解决方案。虽然它们通常表现出极高的准确性,但一个关键问题是,它们的准确性是否会受到与组织病理学无关、但由显微镜图像处理和预处理引起的低层次图像属性的影响。在本文中,我们分析了之前许多研究中使用的流行的 NCT-CRC-HE-100K 大肠癌数据集,结果表明该数据集和获得的结果都可能受到特定数据偏差的影响。数据集暴露出的最突出问题是色彩归一化不当、不同类别之间存在严重的 JPEG 伪影,以及图像动态范围处理不当导致组织样本完全损坏。我们的研究表明,即使是最简单的模型,每幅图像只使用 3 个特征(红、绿、蓝颜色密度),在这个 9 类数据集上的准确率也能超过 50%,而使用不明确捕捉细胞形态特征的颜色直方图,准确率也能超过 82%。此外,我们还表明,基本的 EfficientNet-B0 ImageNet 训练模型在该数据集上可以达到 97.7% 以上的准确率,优于之前针对该任务提出的所有解决方案,包括专用的基础组织病理学模型和大型细胞形态感知神经网络。NCT-CRC-HE数据集是公开的,可免费用于复制所展示的结果。本文中使用的代码和预训练模型可从以下网址获取:https://github.com/gmalivenko/NCT-CRC-HE-experiments。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信