NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful

arXiv - EE - Image and Video Processing Pub Date : 2024-09-17 DOI:arxiv-2409.11546

Andrey Ignatov, Grigory Malivenko

{"title":"NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful","authors":"Andrey Ignatov, Grigory Malivenko","doi":"arxiv-2409.11546","DOIUrl":null,"url":null,"abstract":"Numerous deep learning-based solutions have been proposed for\nhistopathological image analysis over the past years. While they usually\ndemonstrate exceptionally high accuracy, one key question is whether their\nprecision might be affected by low-level image properties not related to\nhistopathology but caused by microscopy image handling and pre-processing. In\nthis paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used\nin numerous prior works and show that both this dataset and the obtained\nresults may be affected by data-specific biases. The most prominent revealed\ndataset issues are inappropriate color normalization, severe JPEG artifacts\ninconsistent between different classes, and completely corrupted tissue samples\nresulting from incorrect image dynamic range handling. We show that even the\nsimplest model using only 3 features per image (red, green and blue color\nintensities) can demonstrate over 50% accuracy on this 9-class dataset, while\nusing color histogram not explicitly capturing cell morphology features yields\nover 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet\npretrained model can achieve over 97.7% accuracy on this dataset, outperforming\nall previously proposed solutions developed for this task, including dedicated\nfoundation histopathological models and large cell morphology-aware neural\nnetworks. The NCT-CRC-HE dataset is publicly available and can be freely used\nto replicate the presented results. The codes and pre-trained models used in\nthis paper are available at\nhttps://github.com/gmalivenko/NCT-CRC-HE-experiments","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Image and Video Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Numerous deep learning-based solutions have been proposed for histopathological image analysis over the past years. While they usually demonstrate exceptionally high accuracy, one key question is whether their precision might be affected by low-level image properties not related to histopathology but caused by microscopy image handling and pre-processing. In this paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used in numerous prior works and show that both this dataset and the obtained results may be affected by data-specific biases. The most prominent revealed dataset issues are inappropriate color normalization, severe JPEG artifacts inconsistent between different classes, and completely corrupted tissue samples resulting from incorrect image dynamic range handling. We show that even the simplest model using only 3 features per image (red, green and blue color intensities) can demonstrate over 50% accuracy on this 9-class dataset, while using color histogram not explicitly capturing cell morphology features yields over 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet pretrained model can achieve over 97.7% accuracy on this dataset, outperforming all previously proposed solutions developed for this task, including dedicated foundation histopathological models and large cell morphology-aware neural networks. The NCT-CRC-HE dataset is publicly available and can be freely used to replicate the presented results. The codes and pre-trained models used in this paper are available at https://github.com/gmalivenko/NCT-CRC-HE-experiments

查看原文本刊更多论文

NCT-CRC-HE：并非所有组织病理学数据集都同样有用

在过去几年中，针对组织病理学图像分析提出了许多基于深度学习的解决方案。虽然它们通常表现出极高的准确性，但一个关键问题是，它们的准确性是否会受到与组织病理学无关、但由显微镜图像处理和预处理引起的低层次图像属性的影响。在本文中，我们分析了之前许多研究中使用的流行的 NCT-CRC-HE-100K 大肠癌数据集，结果表明该数据集和获得的结果都可能受到特定数据偏差的影响。数据集暴露出的最突出问题是色彩归一化不当、不同类别之间存在严重的 JPEG 伪影，以及图像动态范围处理不当导致组织样本完全损坏。我们的研究表明，即使是最简单的模型，每幅图像只使用 3 个特征（红、绿、蓝颜色密度），在这个 9 类数据集上的准确率也能超过 50%，而使用不明确捕捉细胞形态特征的颜色直方图，准确率也能超过 82%。此外，我们还表明，基本的 EfficientNet-B0 ImageNet 训练模型在该数据集上可以达到 97.7% 以上的准确率，优于之前针对该任务提出的所有解决方案，包括专用的基础组织病理学模型和大型细胞形态感知神经网络。NCT-CRC-HE数据集是公开的，可免费用于复制所展示的结果。本文中使用的代码和预训练模型可从以下网址获取：https://github.com/gmalivenko/NCT-CRC-HE-experiments。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Image and Video Processing

自引率

0.00%

发文量