{"title":"NCT-CRC-HE:并非所有组织病理学数据集都同样有用","authors":"Andrey Ignatov, Grigory Malivenko","doi":"arxiv-2409.11546","DOIUrl":null,"url":null,"abstract":"Numerous deep learning-based solutions have been proposed for\nhistopathological image analysis over the past years. While they usually\ndemonstrate exceptionally high accuracy, one key question is whether their\nprecision might be affected by low-level image properties not related to\nhistopathology but caused by microscopy image handling and pre-processing. In\nthis paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used\nin numerous prior works and show that both this dataset and the obtained\nresults may be affected by data-specific biases. The most prominent revealed\ndataset issues are inappropriate color normalization, severe JPEG artifacts\ninconsistent between different classes, and completely corrupted tissue samples\nresulting from incorrect image dynamic range handling. We show that even the\nsimplest model using only 3 features per image (red, green and blue color\nintensities) can demonstrate over 50% accuracy on this 9-class dataset, while\nusing color histogram not explicitly capturing cell morphology features yields\nover 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet\npretrained model can achieve over 97.7% accuracy on this dataset, outperforming\nall previously proposed solutions developed for this task, including dedicated\nfoundation histopathological models and large cell morphology-aware neural\nnetworks. The NCT-CRC-HE dataset is publicly available and can be freely used\nto replicate the presented results. The codes and pre-trained models used in\nthis paper are available at\nhttps://github.com/gmalivenko/NCT-CRC-HE-experiments","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful\",\"authors\":\"Andrey Ignatov, Grigory Malivenko\",\"doi\":\"arxiv-2409.11546\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Numerous deep learning-based solutions have been proposed for\\nhistopathological image analysis over the past years. While they usually\\ndemonstrate exceptionally high accuracy, one key question is whether their\\nprecision might be affected by low-level image properties not related to\\nhistopathology but caused by microscopy image handling and pre-processing. In\\nthis paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used\\nin numerous prior works and show that both this dataset and the obtained\\nresults may be affected by data-specific biases. The most prominent revealed\\ndataset issues are inappropriate color normalization, severe JPEG artifacts\\ninconsistent between different classes, and completely corrupted tissue samples\\nresulting from incorrect image dynamic range handling. We show that even the\\nsimplest model using only 3 features per image (red, green and blue color\\nintensities) can demonstrate over 50% accuracy on this 9-class dataset, while\\nusing color histogram not explicitly capturing cell morphology features yields\\nover 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet\\npretrained model can achieve over 97.7% accuracy on this dataset, outperforming\\nall previously proposed solutions developed for this task, including dedicated\\nfoundation histopathological models and large cell morphology-aware neural\\nnetworks. The NCT-CRC-HE dataset is publicly available and can be freely used\\nto replicate the presented results. The codes and pre-trained models used in\\nthis paper are available at\\nhttps://github.com/gmalivenko/NCT-CRC-HE-experiments\",\"PeriodicalId\":501289,\"journal\":{\"name\":\"arXiv - EE - Image and Video Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Image and Video Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11546\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Image and Video Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful
Numerous deep learning-based solutions have been proposed for
histopathological image analysis over the past years. While they usually
demonstrate exceptionally high accuracy, one key question is whether their
precision might be affected by low-level image properties not related to
histopathology but caused by microscopy image handling and pre-processing. In
this paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used
in numerous prior works and show that both this dataset and the obtained
results may be affected by data-specific biases. The most prominent revealed
dataset issues are inappropriate color normalization, severe JPEG artifacts
inconsistent between different classes, and completely corrupted tissue samples
resulting from incorrect image dynamic range handling. We show that even the
simplest model using only 3 features per image (red, green and blue color
intensities) can demonstrate over 50% accuracy on this 9-class dataset, while
using color histogram not explicitly capturing cell morphology features yields
over 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet
pretrained model can achieve over 97.7% accuracy on this dataset, outperforming
all previously proposed solutions developed for this task, including dedicated
foundation histopathological models and large cell morphology-aware neural
networks. The NCT-CRC-HE dataset is publicly available and can be freely used
to replicate the presented results. The codes and pre-trained models used in
this paper are available at
https://github.com/gmalivenko/NCT-CRC-HE-experiments