{"title":"Automatic labels are as effective as manual labels in digital pathology images classification with deep learning","authors":"Niccolo Marini , Stefano Marchesin , Lluis Borras Ferris , Simon Püttmann , Marek Wodzinski , Riccardo Fratti , Damian Podareanu , Alessandro Caputo , Svetla Boytcheva , Simona Vatrano , Filippo Fraggetta , Iris Nagtegaal , Gianmaria Silvello , Manfredo Atzori , Henning Müller","doi":"10.1016/j.jpi.2025.100462","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to training DL algorithms to perform a specific task is the need for medical experts to manually label the data. Automatic methods to label data exist; however, automatic labels can be noisy, and it is not completely clear in which situations they can be used to train DL models.</div><div>This paper aims to investigate under which circumstances automatic labels can be used to train a DL model for the classification of whole slide images. The analysis involves multiple architectures, such as convolutional neural networks and vision transformer, and 10,604 WSIs as training data, collected from three use cases: celiac disease, lung cancer, and colon cancer, which include respectively binary, multiclass, and multilabel data. The results identify 10% as the percentage of noisy labels before a performance drop-off, so to train effective models for the classification of WSIs, reaching, respectively, F1-scores of 0.906, 0.757, and 0.833. Therefore, an algorithm generating automatic labels needs to stay within this range to be adopted, as shown by the application of Semantic Knowledge Extractor Tool as a tool to automatically extract concepts and use them as labels. Automatic labels are as effective as manual labels in this case, achieving solid performance comparable to that obtained by training models with manual labels.</div></div>","PeriodicalId":37769,"journal":{"name":"Journal of Pathology Informatics","volume":"18 ","pages":"Article 100462"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Informatics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2153353925000483","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to training DL algorithms to perform a specific task is the need for medical experts to manually label the data. Automatic methods to label data exist; however, automatic labels can be noisy, and it is not completely clear in which situations they can be used to train DL models.
This paper aims to investigate under which circumstances automatic labels can be used to train a DL model for the classification of whole slide images. The analysis involves multiple architectures, such as convolutional neural networks and vision transformer, and 10,604 WSIs as training data, collected from three use cases: celiac disease, lung cancer, and colon cancer, which include respectively binary, multiclass, and multilabel data. The results identify 10% as the percentage of noisy labels before a performance drop-off, so to train effective models for the classification of WSIs, reaching, respectively, F1-scores of 0.906, 0.757, and 0.833. Therefore, an algorithm generating automatic labels needs to stay within this range to be adopted, as shown by the application of Semantic Knowledge Extractor Tool as a tool to automatically extract concepts and use them as labels. Automatic labels are as effective as manual labels in this case, achieving solid performance comparable to that obtained by training models with manual labels.
期刊介绍:
The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, symposia, meeting abstracts, book reviews, and correspondence to the editors. All submissions are subject to rigorous peer review by the well-regarded editorial board and by expert referees in appropriate specialties.